Hi Sebastian, You could use the HIHO framework for querying and extracting data from the database and getting it to Hadoop. It supports table joins. More here:
http://code.google.com/p/hiho/ If you need any help, please feel free to contact me directly. Thanks and Regards, Sonal www.meghsoft.com On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote: > Hi everyone, > > Yesterday I've started to look into Hadoop as I was trying to understand > Mahout's FPGrowth algorithm. > > I have a few questions: > Given that I have two tables containing information about items that were > viewed and the second one with items that were bought: > > ItemsViewed Table: > Session, Item > 1,P1 > 1, P2 > 1, P3 > 1, P4 > > 2, P2 > 2, P4 > 2, P6 > > ItemsBought Table: > Session, Item > 1,P2 > 1, P3 > 2, P2 > 2, P4 > > I'm trying to count the pairs of items that occur between these two tables: > <P1, P2> 1 > <P1, P3> 1 > > <P2, P2> 2 > <P2, P3> 1 > <P2, P4> 1 > > <P3, P2> 1 > <P3, P3> 1 > > <P4, P2> 2 > <P4, P3> 1 > <P4, P4> 1 > > <P6, P2> 1 > <P6, P4> 1 > > I'm currently doing this with a database approach (joining the two tables > to generate the pairs into a temp table followed by a merge to aggregate the > results which could potentially be in 100's of millions) and thinking about > using Hadoop's mapreduce to achieve the same. > > As noted above, the original information resides in the database. What I'd > like is to distribute the work based on session and for each session query > the database to retrieve the items associated with the session for both > browse and purchase and count the pair. How do I do that ? I've noticed > there's a DBConfiguration and a DBInputFormat but couldn't find much > details on these. Also I need to access both table in order to generate the > pairs and count them. > Next, when generating the pairs, I'd like to store the final outcome > containing all the pairs whose count is greater than a specified threshold > back into the database. > > Any pointers/recommendations would be great. Thanks. > > Sebastian > >
