Check out PIG. You can do SQL like Map/Reduces using it. Thats the best answer I have
On Sat, Apr 24, 2010 at 12:27 AM, Sebastian Feher <[email protected]> wrote: > Hi Robin, > > Thanks for your answer. Yes, I do understand that FPGrowth gives you the > most frequent co-occurrences and some of the more interesting ones are not > pairs (not to say that pairs are not interesting). However this is not what > I want in this case. I need all the pairs for a given active item that > co-occur with the active item for a number of times greater than threshold. > FPGrowth gives me that but also much more so I'm trying to find an easier > algorithm that simply generates the pairs. I do need to process billions of > data points so performance and scalability are important. I'm also trying to > understand the technologies involved so please bare with me :) > > Currently, I can run a simple (DB2) SQL query on the data set I've > mentioned earlier and get the occurrence count. > > SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM > SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM, > SPACE2.ITEM; > > ACT REC COUNT > 1 2 1 > 1 3 1 > 2 2 2 > 2 3 1 > 2 4 1 > 3 2 1 > 3 3 1 > 4 2 2 > 4 3 1 > 4 4 1 > 6 2 1 > 6 4 1 > > This would give me the right occurrence count. I was able to run this types > of queries successfully on a few million data point batches and merge the > results pretty fast. I want to understand how to implement the equivalent in > Hadoop. Hopefully this makes more sense. > > Sebastian > > ------------------------------ > *From:* Robin Anil <[email protected]> > *To:* [email protected] > *Sent:* Fri, April 23, 2010 11:16:59 AM > *Subject:* Re: counting pairs of items across item types > > Hi Sebastian, Let me get your use case right, You cant to do a pair > counting like a join. you might need to use PIG or something similar to do > this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items > not just co-occurrence of two items. There you just need either one of the > viewed or bought transaction table to generate these patterns. > > Robin > > On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote: > >> ere's a DBConfiguration and a DBInputFormat but couldn't find much >> details on these. Also I need to access both table in order to generate the >> pairs and count them. >> Next, when generating the pairs, I'd like to store the final outcome >> containing all the pairs whose count is greater than a specified threshold >> back into the database. >> > > >
