Ryan, I asked this question a couple days ago but in a slightly different form. What you have to do is make sure the table you're joining is smaller than the leftmost table. As an example,
SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL AND UT.UserID <> 0; In this query, usertracking is a table that is about 8 or 9 gigs. Streamtransfers is a table that is about 4 gigs. As per Zheng's comment, I omitted UserID's of Null or Zero as there are many rows with this key and the join worked as intended. Chris PS. As an aside, Hive is proving to be quite useful to all of our database hackers here at Grooveshark. Thanks to everyone who has committed...I hope to contribute soon. On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote: > It's probably caused by the Cartesian product of many rows from the two > tables with the same key. > > Zheng > > > On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]> wrote: > >> It also looks like the reducers just never stop outputting things likethe >> (following -- see below), causing them to ultimately time out and get >> killed by the system. >> >> 2009-10-25 22:21:18,879 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 100000000 rows >> >> 2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 101000000 rows >> 2009-10-25 22:21:22,010 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 101000000 rows >> 2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 102000000 rows >> >> >> >> 2009-10-25 22:21:25,142 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 102000000 rows >> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 103000000 rows >> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 103000000 rows >> >> >> >> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 104000000 rows >> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 104000000 rows >> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 105000000 rows >> >> >> >> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 105000000 rows >> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 >> forwarding 106000000 rows >> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: >> 5 forwarding 106000000 rows >> >> >> >> >> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote: >> >>> Hello all, >>> >>> Should I expect to be able to do a Hive JOIN between two tables that have >>> about 10 or 15GB of data each? What I'm noticing (for a simple JOIN) is that >>> all the map tasks complete, but the reducers just hang at around 87% or so >>> (for the first set of 4 reducers), and then they eventually just get killed >>> due to inability to respond by the cluster. I can do a JOIN between a large >>> table and a very small table of 10 or so records just fine. >>> >>> Any thoughts? >>> >>> Thanks, >>> Ryan >>> >>> >> > > > -- > Yours, > Zheng >
