Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 04:36:04 -0700

Thanks guys, very useful information. I will modify my query a bit and get
back to you guys on whether it worked or not.


Thanks,
Ryan


On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
[email protected]> wrote:

> Ryan,
>
> I asked this question a couple days ago but in a slightly different form.
>  What you have to do is make sure the table you're joining is smaller than
> the leftmost table.  As an example,
>
> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers
> ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL
> AND UT.UserID <> 0;
>
> In this query, usertracking is a table that is about 8 or 9 gigs.
>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's comment, I
> omitted UserID's of Null or Zero as there are many rows with this key and
> the join worked as intended.
>
> Chris
>
> PS. As an aside, Hive is proving to be quite useful to all of our database
> hackers here at Grooveshark.  Thanks to everyone who has committed...I hope
> to contribute soon.
>
>
> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>
>> It's probably caused by the Cartesian product of many rows from the two
>> tables with the same key.
>>
>> Zheng
>>
>>
>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote:
>>
>>> It also looks like the reducers just never stop outputting things likethe
>>> (following  -- see below), causing them to ultimately time out and get
>>> killed by the system.
>>>
>>> 2009-10-25 22:21:18,879 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 100000000 rows
>>>
>>> 2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 101000000 rows
>>> 2009-10-25 22:21:22,010 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 101000000 rows
>>> 2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 102000000 rows
>>>
>>>
>>>
>>>
>>> 2009-10-25 22:21:25,142 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 102000000 rows
>>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 103000000 rows
>>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 103000000 rows
>>>
>>>
>>>
>>>
>>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 104000000 rows
>>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 104000000 rows
>>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 105000000 rows
>>>
>>>
>>>
>>>
>>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 105000000 rows
>>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
>>> forwarding 106000000 rows
>>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 
>>> 5 forwarding 106000000 rows
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote:
>>>
>>>> Hello all,
>>>>
>>>> Should I expect to be able to do a Hive JOIN between two tables that
>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple JOIN) 
>>>> is
>>>> that all the map tasks complete, but the reducers just hang at around 87% 
>>>> or
>>>> so (for the first set of 4 reducers), and then they eventually just get
>>>> killed due to inability to respond by the cluster. I can do a JOIN between 
>>>> a
>>>> large table and a very small table of 10 or so records just fine.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>

Re: Issues with joining across large tables

Reply via email to