Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 09:48:15 -0700

Ah, got it!

Can you explain your semijoin query though? I see that you are using JOIN
but I don't see the ON (...) part. What does Hive do in this case when ON
(..) doesn't exist?



On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]> wrote:

> I see. If you just want all userid in one table that also appear in another
> table, semijoin would help (it is still under implementation though). You
> can simulate semijoin using JOIN as follows:
>
> select (distinct ut.userid) from usertracking ut join (select
> usertrackingid from streamtransfers group by usertrackingid where
> usertrackingid is not null and usertrackingid<>0) where userid is not null
> and userid <> 0;
>
> BTW, the rewritten query using UNION ALL is not semantically equivalent to
> the original query: it doesn't return "common" user IDs in both table, but
> just the "union" of them.
>
> Ning
>
> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:
>
> Hi Ning,
>
> Basically, I have roughly 60GB of log data that is logically broken up into
> two Hive tables, which works out to be those two tables that I had
> mentioned. Both of these tables share a common key, but this key appears in
> virtually every row of each of the two tables. Each of these tables just has
> 1 partition (by date). My query is very similar (although with different
> data/columns) to Chris Bates' query:
>
>
> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers
> ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL
> AND UT.UserID <> 0;
>
> However, in the above example imagine that in both tables the same users
> appear a lot, so there are lots of matches.
>
>
>
> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]> wrote:
>
>> If it is really a Cartesian product, there is no better way other than
>> increasing the timeout for the reducers. You can do a back-of-the-envelope
>> calculation on how long it takes (e.g., in your log it shows it takes 19
>> sec. to get 6 million rows out of the join). This calculation can also be
>> done if it is not a Cartesian product, and you have an good estimate of how
>> many rows will be produced.
>>
>> In general, joining two huge tables can be avoided by partitioning the
>> fact tables, so that you don't need to join the whole table.
>>
>> BTW, are you joining two fact tables or one dimension table is just huge?
>>
>> Thanks,
>> Ning
>>
>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:
>>
>> So it turns out that the JOIN key of my query basically results in a
>> match/join on all rows of each table! There really is no extra filtering
>> that I can do to exclude invalid rows, etc. The mappers fly by and complete,
>> but the reducers are just moving extremely slowly (my guess due to what
>> Zheng said about the Cartesian product of all rows getting matched).
>>
>> Is there some other way that I could re-write the JOIN or is my only
>> option to increase the timeout on the task trackers so that they don't
>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes
>> (as opposed to the default of 10), and it doesn't seem to be sufficient...
>> Again, this is joining a 33GB table with a 13GB table where join key is
>> shared by virtually all rows in both tables.
>>
>> Thanks,
>> Ryan
>>
>>
>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote:
>>
>>> Thanks guys, very useful information. I will modify my query a bit and
>>> get back to you guys on whether it worked or not.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>>
>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>>> [email protected]> wrote:
>>>
>>>> Ryan,
>>>>
>>>> I asked this question a couple days ago but in a slightly different
>>>> form.  What you have to do is make sure the table you're joining is smaller
>>>> than the leftmost table.  As an example,
>>>>
>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>>
>>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>>>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's comment, 
>>>> I
>>>> omitted UserID's of Null or Zero as there are many rows with this key and
>>>> the join worked as intended.
>>>>
>>>> Chris
>>>>
>>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>>> database hackers here at Grooveshark.  Thanks to everyone who has
>>>> committed...I hope to contribute soon.
>>>>
>>>>
>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>>>>
>>>>> It's probably caused by the Cartesian product of many rows from the two
>>>>> tables with the same key.
>>>>>
>>>>> Zheng
>>>>>
>>>>>
>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote:
>>>>>
>>>>>> It also looks like the reducers just never stop outputting things
>>>>>> likethe (following  -- see below), causing them to ultimately time out 
>>>>>> and
>>>>>> get killed by the system.
>>>>>>
>>>>>> 2009-10-25 22:21:18,879 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 
>>>>>> rows
>>>>>>
>>>>>> 2009-10-25 22:21:22,009 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 rows
>>>>>> 2009-10-25 22:21:22,010 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 
>>>>>> rows
>>>>>> 2009-10-25 22:21:25,141 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 rows
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2009-10-25 22:21:25,142 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 
>>>>>> rows
>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 rows
>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 
>>>>>> rows
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 rows
>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 
>>>>>> rows
>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 rows
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 
>>>>>> rows
>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 rows
>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 
>>>>>> rows
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]>wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> Should I expect to be able to do a Hive JOIN between two tables that
>>>>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple 
>>>>>>> JOIN) is
>>>>>>> that all the map tasks complete, but the reducers just hang at around 
>>>>>>> 87% or
>>>>>>> so (for the first set of 4 reducers), and then they eventually just get
>>>>>>> killed due to inability to respond by the cluster. I can do a JOIN 
>>>>>>> between a
>>>>>>> large table and a very small table of 10 or so records just fine.
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Yours,
>>>>> Zheng
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

Re: Issues with joining across large tables

Reply via email to