Hi everyone, 

Yesterday I've started to look into Hadoop as I was trying to understand 
Mahout's FPGrowth algorithm.

I have a few questions:
Given that I have two tables containing information about items that were 
viewed and the second one with items that were bought:

ItemsViewed Table: 
Session, Item
1,P1 
1, P2
1, P3
1, P4

2, P2
2, P4
2, P6

ItemsBought Table: 
Session, Item
1,P2
1, P3
2, P2
2, P4

I'm trying to count the pairs of items that occur between these two tables:
<P1, P2> 1 
<P1, P3> 1

<P2, P2> 2
<P2, P3> 1
<P2, P4> 1

<P3, P2> 1
<P3, P3> 1

<P4, P2> 2
<P4, P3> 1
<P4, P4> 1

<P6, P2> 1
<P6, P4> 1

I'm currently doing this with a database approach (joining the two tables to 
generate the pairs into a temp table followed by a merge to aggregate the 
results which could potentially be in 100's of millions) and thinking about 
using Hadoop's mapreduce to achieve the same.

As noted above, the original information resides in the database. What I'd like 
is to distribute the work based on session and for each session query the 
database to retrieve the items associated with the session for both browse and 
purchase and count the pair. How do I do that ? I've noticed there's  a 
DBConfiguration and a DBInputFormat but couldn't find much details on these. 
Also I need to access both table in order to generate the pairs and count them.
Next, when generating the pairs, I'd like to store the final outcome containing 
all the pairs whose count is greater than a specified threshold back into the 
database. 

Any pointers/recommendations would be great. Thanks.

Sebastian

Reply via email to