Hi Robin,

Thanks for your answer. Yes, I do understand that FPGrowth gives you the most 
frequent co-occurrences and some of the more interesting ones are not pairs 
(not to say that pairs are not interesting). However this is not what I want in 
this case. I need all the pairs for a given active item that co-occur with the 
active item for a number of times greater than threshold. FPGrowth gives me 
that but also much more so I'm trying to find an easier algorithm that simply 
generates the pairs. I do need to process billions of data points so 
performance and scalability are important. I'm also trying to understand the 
technologies involved so please bare with me :)

Currently, I can run a simple (DB2) SQL query on the data set I've mentioned 
earlier and get the occurrence count.

SELECT SPACE1.ITEM AS ACT, SPACE2.ITEM AS REC, count(*) as COUNT FROM SPACE1, 
SPACE2 where space1.session=space2.session group by SPACE1.ITEM, SPACE2.ITEM;

ACT REC COUNT
121
131
222
231
241
321
331
422
431
441
621
641

This would give me the right occurrence count. I was able to run this types of 
queries successfully on a few million data point batches and merge the results 
pretty fast. I want to understand how to implement the equivalent in Hadoop. 
Hopefully this makes more sense. 

Sebastian



________________________________
From: Robin Anil <[email protected]>
To: [email protected]
Sent: Fri, April 23, 2010 11:16:59 AM
Subject: Re: counting pairs of items across item types

Hi Sebastian, Let me get your use case right, You cant to do a pair counting 
like a join. you might need to use PIG or something similar to do this easily. 
Mahout's PFPGrowth counts the co-occurring, frequent n-items  not just 
co-occurrence of two items. There you just need either one of the viewed or 
bought transaction table to generate these patterns. 


Robin


On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote:

>
>ere's  a DBConfiguration and a DBInputFormat but couldn't find much details on 
>these. Also I need to access both table in order to generate the pairs and 
>count them.
>Next, when generating the pairs, I'd like to store the final outcome 
>containing all the pairs whose count is greater than a specified threshold 
>back into the database. 


Reply via email to