Re: Issues with joining across large tables

Ryan LeCompte Mon, 26 Oct 2009 09:52:32 -0700

Also, if I try your query I get:

FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in
expression specification



Seems like it doesn't like SELECT (DISTINCT ...)


On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte <[email protected]> wrote:

> Ah, got it!
>
> Can you explain your semijoin query though? I see that you are using JOIN
> but I don't see the ON (...) part. What does Hive do in this case when ON
> (..) doesn't exist?
>
>
>
> On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang <[email protected]> wrote:
>
>> I see. If you just want all userid in one table that also appear in
>> another table, semijoin would help (it is still under implementation
>> though). You can simulate semijoin using JOIN as follows:
>>
>> select (distinct ut.userid) from usertracking ut join (select
>> usertrackingid from streamtransfers group by usertrackingid where
>> usertrackingid is not null and usertrackingid<>0) where userid is not null
>> and userid <> 0;
>>
>> BTW, the rewritten query using UNION ALL is not semantically equivalent to
>> the original query: it doesn't return "common" user IDs in both table, but
>> just the "union" of them.
>>
>> Ning
>>
>> On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:
>>
>> Hi Ning,
>>
>> Basically, I have roughly 60GB of log data that is logically broken up
>> into two Hive tables, which works out to be those two tables that I had
>> mentioned. Both of these tables share a common key, but this key appears in
>> virtually every row of each of the two tables. Each of these tables just has
>> 1 partition (by date). My query is very similar (although with different
>> data/columns) to Chris Bates' query:
>>
>>
>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers
>> ST ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL
>> AND UT.UserID <> 0;
>>
>> However, in the above example imagine that in both tables the same users
>> appear a lot, so there are lots of matches.
>>
>>
>>
>> On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang <[email protected]> wrote:
>>
>>> If it is really a Cartesian product, there is no better way other than
>>> increasing the timeout for the reducers. You can do a back-of-the-envelope
>>> calculation on how long it takes (e.g., in your log it shows it takes 19
>>> sec. to get 6 million rows out of the join). This calculation can also be
>>> done if it is not a Cartesian product, and you have an good estimate of how
>>> many rows will be produced.
>>>
>>> In general, joining two huge tables can be avoided by partitioning the
>>> fact tables, so that you don't need to join the whole table.
>>>
>>> BTW, are you joining two fact tables or one dimension table is just huge?
>>>
>>> Thanks,
>>> Ning
>>>
>>> On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:
>>>
>>> So it turns out that the JOIN key of my query basically results in a
>>> match/join on all rows of each table! There really is no extra filtering
>>> that I can do to exclude invalid rows, etc. The mappers fly by and complete,
>>> but the reducers are just moving extremely slowly (my guess due to what
>>> Zheng said about the Cartesian product of all rows getting matched).
>>>
>>> Is there some other way that I could re-write the JOIN or is my only
>>> option to increase the timeout on the task trackers so that they don't
>>> timeout/kill the reducers? I've already upped their timeouts to 30 minutes
>>> (as opposed to the default of 10), and it doesn't seem to be sufficient...
>>> Again, this is joining a 33GB table with a 13GB table where join key is
>>> shared by virtually all rows in both tables.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>
>>> On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte <[email protected]>wrote:
>>>
>>>> Thanks guys, very useful information. I will modify my query a bit and
>>>> get back to you guys on whether it worked or not.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>
>>>>
>>>> On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates <
>>>> [email protected]> wrote:
>>>>
>>>>> Ryan,
>>>>>
>>>>> I asked this question a couple days ago but in a slightly different
>>>>> form.  What you have to do is make sure the table you're joining is 
>>>>> smaller
>>>>> than the leftmost table.  As an example,
>>>>>
>>>>> SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN
>>>>> streamtransfers ST ON (ST.usertrackingid = UT.usertrackingid) WHERE
>>>>> UT.UserID IS NOT NULL AND UT.UserID <> 0;
>>>>>
>>>>> In this query, usertracking is a table that is about 8 or 9 gigs.
>>>>>  Streamtransfers is a table that is about 4 gigs.  As per Zheng's 
>>>>> comment, I
>>>>> omitted UserID's of Null or Zero as there are many rows with this key and
>>>>> the join worked as intended.
>>>>>
>>>>> Chris
>>>>>
>>>>> PS. As an aside, Hive is proving to be quite useful to all of our
>>>>> database hackers here at Grooveshark.  Thanks to everyone who has
>>>>> committed...I hope to contribute soon.
>>>>>
>>>>>
>>>>> On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao <[email protected]> wrote:
>>>>>
>>>>>> It's probably caused by the Cartesian product of many rows from the
>>>>>> two tables with the same key.
>>>>>>
>>>>>> Zheng
>>>>>>
>>>>>>
>>>>>> On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte <[email protected]>wrote:
>>>>>>
>>>>>>> It also looks like the reducers just never stop outputting things
>>>>>>> likethe (following  -- see below), causing them to ultimately time out 
>>>>>>> and
>>>>>>> get killed by the system.
>>>>>>>
>>>>>>> 2009-10-25 22:21:18,879 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 100000000 
>>>>>>> rows
>>>>>>>
>>>>>>> 2009-10-25 22:21:22,009 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 101000000 rows
>>>>>>> 2009-10-25 22:21:22,010 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 101000000 
>>>>>>> rows
>>>>>>> 2009-10-25 22:21:25,141 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 102000000 rows
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009-10-25 22:21:25,142 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 102000000 
>>>>>>> rows
>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 103000000 rows
>>>>>>> 2009-10-25 22:21:28,263 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 103000000 
>>>>>>> rows
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 104000000 rows
>>>>>>> 2009-10-25 22:21:31,387 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 104000000 
>>>>>>> rows
>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 105000000 rows
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2009-10-25 22:21:34,510 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 105000000 
>>>>>>> rows
>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 106000000 rows
>>>>>>> 2009-10-25 22:21:37,633 INFO 
>>>>>>> org.apache.hadoop.hive.ql.exec.SelectOperator: 5 forwarding 106000000 
>>>>>>> rows
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hello all,
>>>>>>>>
>>>>>>>> Should I expect to be able to do a Hive JOIN between two tables that
>>>>>>>> have about 10 or 15GB of data each? What I'm noticing (for a simple 
>>>>>>>> JOIN) is
>>>>>>>> that all the map tasks complete, but the reducers just hang at around 
>>>>>>>> 87% or
>>>>>>>> so (for the first set of 4 reducers), and then they eventually just get
>>>>>>>> killed due to inability to respond by the cluster. I can do a JOIN 
>>>>>>>> between a
>>>>>>>> large table and a very small table of 10 or so records just fine.
>>>>>>>>
>>>>>>>> Any thoughts?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ryan
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Yours,
>>>>>> Zheng
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: Issues with joining across large tables

Reply via email to