Thanks Sonal. Do you have any example of how to use your framework? Also a few other questions: What do you mean by "It supports table joins"? I probably missed the meaning of this as I need to understand more about how Hadoop works. I've seen it mentioned that HIHO supports MySQL. How about other databases? Do they work fine?
Thanks, Sebastian ________________________________ From: Sonal Goyal <[email protected]> To: [email protected] Sent: Fri, April 23, 2010 11:13:02 AM Subject: Re: counting pairs of items across item types Hi Sebastian, You could use the HIHO framework for querying and extracting data from the database and getting it to Hadoop. It supports table joins. More here: http://code.google.com/p/hiho/ If you need any help, please feel free to contact me directly. Thanks and Regards, Sonal www.meghsoft.com On Fri, Apr 23, 2010 at 7:48 PM, Sebastian Feher <[email protected]> wrote: Hi everyone, > > >Yesterday I've started to look into Hadoop as I was trying to understand >Mahout's FPGrowth algorithm. > > >I have a few questions: >Given that I have two tables containing information about items that were >viewed and the second one with items that were bought: > > >ItemsViewed Table: >Session, Item >1,P1 >1, P2 >1, P3 >1, P4 > > >2, P2 >2, P4 >2, P6 > > >ItemsBought Table: >Session, Item >>1,P2 >1, P3 >2, P2 >2, P4 > > >I'm trying to count the pairs of items that occur between these two tables: ><P1, P2> > 1 ><P1, P3> 1 > > ><P2, P2> 2 ><P2, P3> 1 ><P2, P4> 1 > > ><P3, P2> 1 ><P3, P3> 1 > > >><P4, P2> 2 ><P4, P3> 1 ><P4, P4> 1 > > ><P6, P2> 1 ><P6, P4> 1 > > >I'm currently doing this with a database approach (joining the two tables to >generate the pairs into a temp table followed by a merge to aggregate the >results which could potentially be in 100's of millions) and thinking about >using Hadoop's mapreduce to achieve the same. > > >As noted above, the original information resides in the database. What I'd >like is to distribute the work based on session and for each session query the >database to retrieve the items associated with the session for both browse and >purchase and count the pair. How do I do that > ? I've noticed there's a DBConfiguration and a DBInputFormat but couldn't > find much details on these. Also I need to access both table in order to > generate the pairs and count them. >Next, when generating the pairs, I'd like to store the final outcome >containing all the pairs whose count is greater than a specified threshold >back into the database. > > >Any pointers/recommendations would be great. Thanks. > > >Sebastian >
