Well, It looks like the JOIN query for me just isn't working out. I even
bumped up the number of reducers to 100 (as opposed to the pre-determined 38
from Hive), and the reducers still just have too much data to deal with (due
to Cartesian product explosion).
So, it looks like I can re-write the JOIN query using a UNION ALL and
sub-query, which produces the desired result in a reasonable amount of time.
Going with Chris Bates' earlier example:
Here is what the JOIN looks like:
SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers
ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL
AND UT.UserID <> 0;
And here is what the new queries based on UNION ALL looks like:
SELECT COUNT(DISTINCT usertrackingid) FROM
(SELECT usertrackingid
FROM usertracking_actions GROUP BY usertrackingid
UNION ALL
SELECT usertrackingid
FROM streamtransfers GROUP BY usertrackingid
) q1;
The above query (similar to mine) produces 3 map/reduce jobs.
My guess is that the JOIN query would run faster than the second query -- it
would be nice if you could try it out and compare performance, Chris.
Thanks,
Ryan
On Mon, Oct 26, 2009 at 8:16 AM, Ryan LeCompte <[email protected]> wrote:
> So it turns out that the JOIN key of my query basically results in a
> match/join on all rows of each table! There really is no extra filtering
> that I can do to exclude invalid rows, etc. The mappers fly by and complete,
> but the reducers are just moving extremely slowly (my guess due to what
> Zheng said about the Cartesian product of all rows getting matched).
>
> Is there some other way that I could re-write the JOIN or is my only option
> to increase the timeout on the task trackers so that they don't timeout/kill
> the reducers? I've already upped their timeouts to 30 minutes (as opposed to
> the default of 10), and it doesn't seem to be sufficient... Again, this is
> joining a 33GB table with a 13GB table where join key is shared by virtually
> all rows in both tables.
>
> Thanks,
> Ryan
>
>
>
> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]> wrote:
>
>> Thanks guys, very useful information. I will modify my query a bit and get
>> back to you guys on whether it worked or not.
>>
>> Thanks,
>> Ryan
>>
>>
>>
>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>> [email protected]> wrote:
>>
>>> Ryan,
>>>
>>> I asked this question a couple days ago but in a slightly different form.
>>> What you have to do is make sure the table you're joining is smaller than
>>> the leftmost table. As an example,
>>>
>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>
>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>> Streamtransfers is a table that is about 4 gigs. As per Zheng's comment, I
>>> omitted UserID's of Null or Zero as there are many rows with this key and
>>> the join worked as intended.
>>>
>>> Chris
>>>
>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>> database hackers here at Grooveshark. Thanks to everyone who has
>>> committed...I hope to contribute soon.
>>>
>>>
>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>>>
>>>> It's probably caused by the Cartesian product of many rows from the two
>>>> tables with the same key.
>>>>
>>>> Zheng
>>>>
>>>>
>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote:
>>>>
>>>>> It also looks like the reducers just never stop outputting things
>>>>> likethe (following -- see below), causing them to ultimately time out and
>>>>> get killed by the system.
>>>>>
>>>>> 2009-10-25 22:21:18,879 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 rows
>>>>>
>>>>> 2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 101000000 rows
>>>>> 2009-10-25 22:21:22,010 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 rows
>>>>> 2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 102000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:25,142 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 rows
>>>>> 2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 103000000 rows
>>>>> 2009-10-25 22:21:28,263 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 104000000 rows
>>>>> 2009-10-25 22:21:31,387 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 rows
>>>>> 2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 105000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009-10-25 22:21:34,510 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 rows
>>>>> 2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator:
>>>>> 4 forwarding 106000000 rows
>>>>> 2009-10-25 22:21:37,633 INFO
>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 rows
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> Should I expect to be able to do a Hive JOIN between two tables that
>>>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple
>>>>>> JOIN) is
>>>>>> that all the map tasks complete, but the reducers just hang at around
>>>>>> 87% or
>>>>>> so (for the first set of 4 reducers), and then they eventually just get
>>>>>> killed due to inability to respond by the cluster. I can do a JOIN
>>>>>> between a
>>>>>> large table and a very small table of 10 or so records just fine.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Yours,
>>>> Zheng
>>>>
>>>
>>>
>>
>