Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 09:34:15 -0700

Hi Ning,

Basically, I have roughly 60GB of log data that is logically broken up into
two Hive tables, which works out to be those two tables that I had
mentioned. Both of these tables share a common key, but this key appears in
virtually every row of each of the two tables. Each of these tables just has
1 partition (by date). My query is very similar (although with different
data/columns) to Chris Bates' query:



SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers
ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL
AND UT.UserID <> 0;

However, in the above example imagine that in both tables the same users
appear a lot, so there are lots of matches.



On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]> wrote:

> If it is really a Cartesian product, there is no better way other than
> increasing the timeout for the reducers. You can do a back-of-the-envelope
> calculation on how long it takes (e.g., in your log it shows it takes 19
> sec. to get 6 million rows out of the join). This calculation can also be
> done if it is not a Cartesian product, and you have an good estimate of how
> many rows will be produced.
>
> In general, joining two huge tables can be avoided by partitioning the fact
> tables, so that you don't need to join the whole table.
>
> BTW, are you joining two fact tables or one dimension table is just huge?
>
> Thanks,
> Ning
>
> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:
>
> So it turns out that the JOIN key of my query basically results in a
> match/join on all rows of each table! There really is no extra filtering
> that I can do to exclude invalid rows, etc. The mappers fly by and complete,
> but the reducers are just moving extremely slowly (my guess due to what
> Zheng said about the Cartesian product of all rows getting matched).
>
> Is there some other way that I could re-write the JOIN or is my only option
> to increase the timeout on the task trackers so that they don't timeout/kill
> the reducers? I've already upped their timeouts to 30 minutes (as opposed to
> the default of 10), and it doesn't seem to be sufficient... Again, this is
> joining a 33GB table with a 13GB table where join key is shared by virtually
> all rows in both tables.
>
> Thanks,
> Ryan
>
>
> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]> wrote:
>
>> Thanks guys, very useful information. I will modify my query a bit and get
>> back to you guys on whether it worked or not.
>>
>> Thanks,
>> Ryan
>>
>>
>>
>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>> [email protected]> wrote:
>>
>>> Ryan,
>>>
>>> I asked this question a couple days ago but in a slightly different form.
>>>  What you have to do is make sure the table you're joining is smaller than
>>> the leftmost table.  As an example,
>>>
>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>
>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's comment, I
>>> omitted UserID's of Null or Zero as there are many rows with this key and
>>> the join worked as intended.
>>>
>>> Chris
>>>
>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>> database hackers here at Grooveshark.  Thanks to everyone who has
>>> committed...I hope to contribute soon.
>>>
>>>
>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>>>
>>>> It's probably caused by the Cartesian product of many rows from the two
>>>> tables with the same key.
>>>>
>>>> Zheng
>>>>
>>>>
>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote:
>>>>
>>>>> It also looks like the reducers just never stop outputting things
>>>>> likethe (following  -- see below), causing them to ultimately time out and
>>>>> get killed by the system.
>>>>>
>>>>> 2009-10-25 22:21:18,879 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 rows
>>>>>
>>>>> 2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 101000000 rows
>>>>> 2009-10-25 22:21:22,010 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 rows
>>>>> 2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 102000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:25,142 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 rows
>>>>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 103000000 rows
>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 104000000 rows
>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 rows
>>>>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 105000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 rows
>>>>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>>> 4 forwarding 106000000 rows
>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Should I expect to be able to do a Hive JOIN between two tables that
>>>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple 
>>>>>> JOIN) is
>>>>>> that all the map tasks complete, but the reducers just hang at around 
>>>>>> 87% or
>>>>>> so (for the first set of 4 reducers), and then they eventually just get
>>>>>> killed due to inability to respond by the cluster. I can do a JOIN 
>>>>>> between a
>>>>>> large table and a very small table of 10 or so records just fine.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Yours,
>>>> Zheng
>>>>
>>>
>>>
>>
>
>

Re: Issues with joining across large tables

Reply via email to