Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 10:56:31 -0700

Ning,

Thanks for the explanation. I was able to take your example semijoin query
and adapt it to my use case. Hive builds 3 map/reduce jobs from the query.
The first one completes, but the second one still times out on the reducer.
Perhaps my cluster is too small for this? It's only 4 nodes.



On Mon, Oct 26, 2009 at 1:47 PM, Ning Zhang <[email protected]> wrote:

> This query is using the regular join to simulate semijoin. The proposed
> semijoin syntax is something like:
>
> select A.id from A left semi join B on A.id = B.id;
>
> Semantically semijoin is the same as the simulated query using JOIN. but
> proposed semijoin operation is more efficient in that:
> 1) the simulation has a group by which requires a separate map-reduce task.
> In the semijoin, the group by is map-side only so that we can eliminate one
> map-reduce task.
> 2) the semijoin implements the early-exit semantics: whenever a match is
> found on B, the rest of the tuples in B will be omitted. While the JOIN will
> keep matching even though a match is found.
>
> Thanks,
> Ning
>
> The difference between the semijoin implementation and the simulation using
> regular join is that the group by in the
>
>
>
> On Oct 26, 2009, at 9:56 AM, Ryan LeCompte wrote:
>
> I got the query working (my bad syntax caused it to error out)... However,
> can you please explain what Hive is doing in your example query for the
> semijoin?
>
> Thanks,
> Ryan
>
> On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte <[email protected]>wrote:
>
>> Also, if I try your query I get:
>>
>> FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in
>> expression specification
>>
>>
>> Seems like it doesn't like SELECT (DISTINCT ...)
>>
>>
>>
>> On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]>wrote:
>>
>>> Ah, got it!
>>>
>>> Can you explain your semijoin query though? I see that you are using JOIN
>>> but I don't see the ON (...) part. What does Hive do in this case when ON
>>> (..) doesn't exist?
>>>
>>>
>>>
>>> On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]>wrote:
>>>
>>>> I see. If you just want all userid in one table that also appear in
>>>> another table, semijoin would help (it is still under implementation
>>>> though). You can simulate semijoin using JOIN as follows:
>>>>
>>>> select (distinct ut.userid) from usertracking ut join (select
>>>> usertrackingid from streamtransfers group by usertrackingid where
>>>> usertrackingid is not null and usertrackingid<>0) where userid is not null
>>>> and userid <> 0;
>>>>
>>>> BTW, the rewritten query using UNION ALL is not semantically equivalent
>>>> to the original query: it doesn't return "common" user IDs in both table,
>>>> but just the "union" of them.
>>>>
>>>> Ning
>>>>
>>>> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:
>>>>
>>>> Hi Ning,
>>>>
>>>> Basically, I have roughly 60GB of log data that is logically broken up
>>>> into two Hive tables, which works out to be those two tables that I had
>>>> mentioned. Both of these tables share a common key, but this key appears in
>>>> virtually every row of each of the two tables. Each of these tables just 
>>>> has
>>>> 1 partition (by date). My query is very similar (although with different
>>>> data/columns) to Chris Bates' query:
>>>>
>>>>
>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>>
>>>> However, in the above example imagine that in both tables the same users
>>>> appear a lot, so there are lots of matches.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]>wrote:
>>>>
>>>>> If it is really a Cartesian product, there is no better way other than
>>>>> increasing the timeout for the reducers. You can do a back-of-the-envelope
>>>>> calculation on how long it takes (e.g., in your log it shows it takes 19
>>>>> sec. to get 6 million rows out of the join). This calculation can also be
>>>>> done if it is not a Cartesian product, and you have an good estimate of 
>>>>> how
>>>>> many rows will be produced.
>>>>>
>>>>> In general, joining two huge tables can be avoided by partitioning the
>>>>> fact tables, so that you don't need to join the whole table.
>>>>>
>>>>> BTW, are you joining two fact tables or one dimension table is just
>>>>> huge?
>>>>>
>>>>> Thanks,
>>>>> Ning
>>>>>
>>>>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:
>>>>>
>>>>> So it turns out that the JOIN key of my query basically results in a
>>>>> match/join on all rows of each table! There really is no extra filtering
>>>>> that I can do to exclude invalid rows, etc. The mappers fly by and 
>>>>> complete,
>>>>> but the reducers are just moving extremely slowly (my guess due to what
>>>>> Zheng said about the Cartesian product of all rows getting matched).
>>>>>
>>>>> Is there some other way that I could re-write the JOIN or is my only
>>>>> option to increase the timeout on the task trackers so that they don't
>>>>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes
>>>>> (as opposed to the default of 10), and it doesn't seem to be sufficient...
>>>>> Again, this is joining a 33GB table with a 13GB table where join key is
>>>>> shared by virtually all rows in both tables.
>>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>>
>>>>>
>>>>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote:
>>>>>
>>>>>> Thanks guys, very useful information. I will modify my query a bit and
>>>>>> get back to you guys on whether it worked or not.
>>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Ryan,
>>>>>>>
>>>>>>> I asked this question a couple days ago but in a slightly different
>>>>>>> form.  What you have to do is make sure the table you're joining is 
>>>>>>> smaller
>>>>>>> than the leftmost table.  As an example,
>>>>>>>
>>>>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>>>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>>>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>>>>>
>>>>>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>>>>>>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's 
>>>>>>> comment, I
>>>>>>> omitted UserID's of Null or Zero as there are many rows with this key 
>>>>>>> and
>>>>>>> the join worked as intended.
>>>>>>>
>>>>>>> Chris
>>>>>>>
>>>>>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>>>>>> database hackers here at Grooveshark.  Thanks to everyone who has
>>>>>>> committed...I hope to contribute soon.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]>wrote:
>>>>>>>
>>>>>>>> It's probably caused by the Cartesian product of many rows from the
>>>>>>>> two tables with the same key.
>>>>>>>>
>>>>>>>> Zheng
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> It also looks like the reducers just never stop outputting things
>>>>>>>>> likethe (following  -- see below), causing them to ultimately time 
>>>>>>>>> out and
>>>>>>>>> get killed by the system.
>>>>>>>>>
>>>>>>>>> 2009-10-25 22:21:18,879 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 
>>>>>>>>> rows
>>>>>>>>>
>>>>>>>>> 2009-10-25 22:21:22,009 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:22,010 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:25,141 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 
>>>>>>>>> rows
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009-10-25 22:21:25,142 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 
>>>>>>>>> rows
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 
>>>>>>>>> rows
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 
>>>>>>>>> rows
>>>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 
>>>>>>>>> rows
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hello all,
>>>>>>>>>>
>>>>>>>>>> Should I expect to be able to do a Hive JOIN between two tables
>>>>>>>>>> that have about 10 or 15GB of data each? What I'm noticing (for a 
>>>>>>>>>> simple
>>>>>>>>>> JOIN) is that all the map tasks complete, but the reducers just hang 
>>>>>>>>>> at
>>>>>>>>>> around 87% or so (for the first set of 4 reducers), and then they 
>>>>>>>>>> eventually
>>>>>>>>>> just get killed due to inability to respond by the cluster. I can do 
>>>>>>>>>> a JOIN
>>>>>>>>>> between a large table and a very small table of 10 or so records 
>>>>>>>>>> just fine.
>>>>>>>>>>
>>>>>>>>>> Any thoughts?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Yours,
>>>>>>>> Zheng
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>

Re: Issues with joining across large tables

Reply via email to