Check out PIG. You can do SQL like Map/Reduces using it. Thats the best
answer I have


On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <[email protected]> wrote:

> Hi Robin,
>
> Thanks for your answer. Yes, I do understand that FPGrowth gives you the
> most frequent co-occurrences and some of the more interesting ones are not
> pairs (not to say that pairs are not interesting). However this is not what
> I want in this case. I need all the pairs for a given active item that
> co-occur with the active item for a number of times greater than threshold.
> FPGrowth gives me that but also much more so I'm trying to find an easier
> algorithm that simply generates the pairs. I do need to process billions of
> data points so performance and scalability are important. I'm also trying to
> understand the technologies involved so please bare with me :)
>
> Currently, I can run a simple (DB2) SQL query on the data set I've
> mentioned earlier and get the occurrence count.
>
> SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM
> SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM,
> SPACE2.ITEM;
>
> ACT REC COUNT
> 1 2 1
> 1 3 1
> 2 2 2
> 2 3 1
> 2 4 1
> 3 2 1
> 3 3 1
> 4 2 2
> 4 3 1
> 4 4 1
> 6 2 1
> 6 4 1
>
> This would give me the right occurrence count. I was able to run this types
> of queries successfully on a few million data point batches and merge the
> results pretty fast. I want to understand how to implement the equivalent in
> Hadoop. Hopefully this makes more sense.
>
> Sebastian
>
> ------------------------------
> *From:* Robin Anil <[email protected]>
> *To:* [email protected]
> *Sent:* Fri, April 23, 2010 11:16:59 AM
> *Subject:* Re: counting pairs of items across item types
>
> Hi Sebastian, Let me get your use case right, You cant to do a pair
> counting like a join. you might need to use PIG or something similar to do
> this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items
>  not just co-occurrence of two items. There you just need either one of the
> viewed or bought transaction table to generate these patterns.
>
> Robin
>
> On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote:
>
>> ere's  a DBConfiguration and a DBInputFormat but couldn't find much
>> details on these. Also I need to access both table in order to generate the
>> pairs and count them.
>> Next, when generating the pairs, I'd like to store the final outcome
>> containing all the pairs whose count is greater than a specified threshold
>> back into the database.
>>
>
>
>

Reply via email to