Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 09:56:34 -0700

I got the query working (my bad syntax caused it to error out)... However,
can you please explain what Hive is doing in your example query for the
semijoin?


Thanks,
Ryan

On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte <[email protected]> wrote:

> Also, if I try your query I get:
>
> FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in
> expression specification
>
>
> Seems like it doesn't like SELECT (DISTINCT ...)
>
>
>
> On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]>wrote:
>
>> Ah, got it!
>>
>> Can you explain your semijoin query though? I see that you are using JOIN
>> but I don't see the ON (...) part. What does Hive do in this case when ON
>> (..) doesn't exist?
>>
>>
>>
>> On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]> wrote:
>>
>>> I see. If you just want all userid in one table that also appear in
>>> another table, semijoin would help (it is still under implementation
>>> though). You can simulate semijoin using JOIN as follows:
>>>
>>> select (distinct ut.userid) from usertracking ut join (select
>>> usertrackingid from streamtransfers group by usertrackingid where
>>> usertrackingid is not null and usertrackingid<>0) where userid is not null
>>> and userid <> 0;
>>>
>>> BTW, the rewritten query using UNION ALL is not semantically equivalent
>>> to the original query: it doesn't return "common" user IDs in both table,
>>> but just the "union" of them.
>>>
>>> Ning
>>>
>>> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:
>>>
>>> Hi Ning,
>>>
>>> Basically, I have roughly 60GB of log data that is logically broken up
>>> into two Hive tables, which works out to be those two tables that I had
>>> mentioned. Both of these tables share a common key, but this key appears in
>>> virtually every row of each of the two tables. Each of these tables just has
>>> 1 partition (by date). My query is very similar (although with different
>>> data/columns) to Chris Bates' query:
>>>
>>>
>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>
>>> However, in the above example imagine that in both tables the same users
>>> appear a lot, so there are lots of matches.
>>>
>>>
>>>
>>> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]>wrote:
>>>
>>>> If it is really a Cartesian product, there is no better way other than
>>>> increasing the timeout for the reducers. You can do a back-of-the-envelope
>>>> calculation on how long it takes (e.g., in your log it shows it takes 19
>>>> sec. to get 6 million rows out of the join). This calculation can also be
>>>> done if it is not a Cartesian product, and you have an good estimate of how
>>>> many rows will be produced.
>>>>
>>>> In general, joining two huge tables can be avoided by partitioning the
>>>> fact tables, so that you don't need to join the whole table.
>>>>
>>>> BTW, are you joining two fact tables or one dimension table is just
>>>> huge?
>>>>
>>>> Thanks,
>>>> Ning
>>>>
>>>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:
>>>>
>>>> So it turns out that the JOIN key of my query basically results in a
>>>> match/join on all rows of each table! There really is no extra filtering
>>>> that I can do to exclude invalid rows, etc. The mappers fly by and 
>>>> complete,
>>>> but the reducers are just moving extremely slowly (my guess due to what
>>>> Zheng said about the Cartesian product of all rows getting matched).
>>>>
>>>> Is there some other way that I could re-write the JOIN or is my only
>>>> option to increase the timeout on the task trackers so that they don't
>>>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes
>>>> (as opposed to the default of 10), and it doesn't seem to be sufficient...
>>>> Again, this is joining a 33GB table with a 13GB table where join key is
>>>> shared by virtually all rows in both tables.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote:
>>>>
>>>>> Thanks guys, very useful information. I will modify my query a bit and
>>>>> get back to you guys on whether it worked or not.
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Ryan,
>>>>>>
>>>>>> I asked this question a couple days ago but in a slightly different
>>>>>> form.  What you have to do is make sure the table you're joining is 
>>>>>> smaller
>>>>>> than the leftmost table.  As an example,
>>>>>>
>>>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>>>>
>>>>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>>>>>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's 
>>>>>> comment, I
>>>>>> omitted UserID's of Null or Zero as there are many rows with this key and
>>>>>> the join worked as intended.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>>>>> database hackers here at Grooveshark.  Thanks to everyone who has
>>>>>> committed...I hope to contribute soon.
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>>>>>>
>>>>>>> It's probably caused by the Cartesian product of many rows from the
>>>>>>> two tables with the same key.
>>>>>>>
>>>>>>> Zheng
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> It also looks like the reducers just never stop outputting things
>>>>>>>> likethe (following  -- see below), causing them to ultimately time out 
>>>>>>>> and
>>>>>>>> get killed by the system.
>>>>>>>>
>>>>>>>> 2009-10-25 22:21:18,879 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 
>>>>>>>> rows
>>>>>>>>
>>>>>>>> 2009-10-25 22:21:22,009 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:22,010 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:25,141 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 
>>>>>>>> rows
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009-10-25 22:21:25,142 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 
>>>>>>>> rows
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 
>>>>>>>> rows
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 
>>>>>>>> rows
>>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 
>>>>>>>> rows
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hello all,
>>>>>>>>>
>>>>>>>>> Should I expect to be able to do a Hive JOIN between two tables
>>>>>>>>> that have about 10 or 15GB of data each? What I'm noticing (for a 
>>>>>>>>> simple
>>>>>>>>> JOIN) is that all the map tasks complete, but the reducers just hang 
>>>>>>>>> at
>>>>>>>>> around 87% or so (for the first set of 4 reducers), and then they 
>>>>>>>>> eventually
>>>>>>>>> just get killed due to inability to respond by the cluster. I can do 
>>>>>>>>> a JOIN
>>>>>>>>> between a large table and a very small table of 10 or so records just 
>>>>>>>>> fine.
>>>>>>>>>
>>>>>>>>> Any thoughts?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Yours,
>>>>>>> Zheng
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Issues with joining across large tables

Reply via email to