Hi Robin, Thanks for your answer. Yes, I do understand that FPGrowth gives you the most frequent co-occurrences and some of the more interesting ones are not pairs (not to say that pairs are not interesting). However this is not what I want in this case. I need all the pairs for a given active item that co-occur with the active item for a number of times greater than threshold. FPGrowth gives me that but also much more so I'm trying to find an easier algorithm that simply generates the pairs. I do need to process billions of data points so performance and scalability are important. I'm also trying to understand the technologies involved so please bare with me :)
Currently, I can run a simple (DB2) SQL query on the data set I've mentioned earlier and get the occurrence count. SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM SPACE1, SPACE2 where space1.session=space2.session group by SPACE1.ITEM, SPACE2.ITEM; ACT REC COUNT 121 131 222 231 241 321 331 422 431 441 621 641 This would give me the right occurrence count. I was able to run this types of queries successfully on a few million data point batches and merge the results pretty fast. I want to understand how to implement the equivalent in Hadoop. Hopefully this makes more sense. Sebastian ________________________________ From: Robin Anil <[email protected]> To: [email protected] Sent: Fri, April 23, 2010 11:16:59 AM Subject: Re: counting pairs of items across item types Hi Sebastian, Let me get your use case right, You cant to do a pair counting like a join. you might need to use PIG or something similar to do this easily. Mahout's PFPGrowth counts the co-occurring, frequent n-items not just co-occurrence of two items. There you just need either one of the viewed or bought transaction table to generate these patterns. Robin On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote: > >ere's a DBConfiguration and a DBInputFormat but couldn't find much details on >these. Also I need to access both table in order to generate the pairs and >count them. >Next, when generating the pairs, I'd like to store the final outcome >containing all the pairs whose count is greater than a specified threshold >back into the database.
