Thanks guys, very useful information. I will modify my query a bit and get back to you guys on whether it worked or not.
Thanks, Ryan On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates < [email protected]> wrote: > Ryan, > > I asked this question a couple days ago but in a slightly different form. > What you have to do is make sure the table you're joining is smaller than > the leftmost table. As an example, > > SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers > ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL > AND UT.UserID <> 0; > > In this query, usertracking is a table that is about 8 or 9 gigs. > Streamtransfers is a table that is about 4 gigs. As per Zheng's comment, I > omitted UserID's of Null or Zero as there are many rows with this key and > the join worked as intended. > > Chris > > PS. As an aside, Hive is proving to be quite useful to all of our database > hackers here at Grooveshark. Thanks to everyone who has committed...I hope > to contribute soon. > > > On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote: > >> It's probably caused by the Cartesian product of many rows from the two >> tables with the same key. >> >> Zheng >> >> >> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote: >> >>> It also looks like the reducers just never stop outputting things likethe >>> (following -- see below), causing them to ultimately time out and get >>> killed by the system. >>> >>> 2009-10-25 22:21:18,879 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 100000000 rows >>> >>> 2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 101000000 rows >>> 2009-10-25 22:21:22,010 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 101000000 rows >>> 2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 102000000 rows >>> >>> >>> >>> >>> 2009-10-25 22:21:25,142 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 102000000 rows >>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 103000000 rows >>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 103000000 rows >>> >>> >>> >>> >>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 104000000 rows >>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 104000000 rows >>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 105000000 rows >>> >>> >>> >>> >>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 105000000 rows >>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >>> forwarding 106000000 rows >>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >>> 5 forwarding 106000000 rows >>> >>> >>> >>> >>> >>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote: >>> >>>> Hello all, >>>> >>>> Should I expect to be able to do a Hive JOIN between two tables that >>>> have about 10 or 15GB of data each? What I'm noticing (for a simple JOIN) >>>> is >>>> that all the map tasks complete, but the reducers just hang at around 87% >>>> or >>>> so (for the first set of 4 reducers), and then they eventually just get >>>> killed due to inability to respond by the cluster. I can do a JOIN between >>>> a >>>> large table and a very small table of 10 or so records just fine. >>>> >>>> Any thoughts? >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>> >> >> >> -- >> Yours, >> Zheng >> > >
