Re: Issues with joining across large tables

Ning Zhang Mon, 26 Oct 2009 11:13:03 -0700

Yes, 4 nodes probably is too small for this. The default value of bytes per 
reducer is 1GB (see hive.exec.reducers.bytes.per.reducer in hive-default.xml).



Ning

On Oct 26, 2009, at 10:56 AM, Ryan LeCompte wrote:

Ning,

Thanks for the explanation. I was able to take your example semijoin query and 
adapt it to my use case. Hive builds 3 map/reduce jobs from the query. The 
first one completes, but the second one still times out on the reducer. Perhaps 
my cluster is too small for this? It's only 4 nodes.


On Mon, Oct 26, 2009 at 1:47 PM, Ning Zhang 
<[email protected]<mailto:[email protected]>> wrote:
This query is using the regular join to simulate semijoin. The proposed 
semijoin syntax is something like:

select A.id from A left semi join B on A.id = B.id;

Semantically semijoin is the same as the simulated query using JOIN. but 
proposed semijoin operation is more efficient in that:
1) the simulation has a group by which requires a separate map-reduce task. In 
the semijoin, the group by is map-side only so that we can eliminate one 
map-reduce task.
2) the semijoin implements the early-exit semantics: whenever a match is found 
on B, the rest of the tuples in B will be omitted. While the JOIN will keep 
matching even though a match is found.

Thanks,
Ning

The difference between the semijoin implementation and the simulation using 
regular join is that the group by in the



On Oct 26, 2009, at 9:56 AM, Ryan LeCompte wrote:

I got the query working (my bad syntax caused it to error out)... However, can 
you please explain what Hive is doing in your example query for the semijoin?

Thanks,
Ryan

On Mon, Oct 26, 2009 at 12:52 PM, Ryan LeCompte 
<[email protected]<mailto:[email protected]>> wrote:
Also, if I try your query I get:

FAILED: Parse Error: line 1:8 cannot recognize input 'DISTINCT' in expression 
specification


Seems like it doesn't like SELECT (DISTINCT ...)



On Mon, Oct 26, 2009 at 12:47 PM, Ryan LeCompte 
<[email protected]<mailto:[email protected]>> wrote:
Ah, got it!

Can you explain your semijoin query though? I see that you are using JOIN but I 
don't see the ON (...) part. What does Hive do in this case when ON (..) 
doesn't exist?



On Mon, Oct 26, 2009 at 12:41 PM, Ning Zhang 
<[email protected]<mailto:[email protected]>> wrote:
I see. If you just want all userid in one table that also appear in another 
table, semijoin would help (it is still under implementation though). You can 
simulate semijoin using JOIN as follows:

select (distinct ut.userid) from usertracking ut join (select usertrackingid 
from streamtransfers group by usertrackingid where usertrackingid is not null 
and usertrackingid<>0) where userid is not null and userid <> 0;

BTW, the rewritten query using UNION ALL is not semantically equivalent to the 
original query: it doesn't return "common" user IDs in both table, but just the 
"union" of them.

Ning

On Oct 26, 2009, at 9:33 AM, Ryan LeCompte wrote:

Hi Ning,

Basically, I have roughly 60GB of log data that is logically broken up into two 
Hive tables, which works out to be those two tables that I had mentioned. Both 
of these tables share a common key, but this key appears in virtually every row 
of each of the two tables. Each of these tables just has 1 partition (by date). 
My query is very similar (although with different data/columns) to Chris Bates' 
query:


SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers ST 
ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL AND 
UT.UserID <> 0;

However, in the above example imagine that in both tables the same users appear 
a lot, so there are lots of matches.



On Mon, Oct 26, 2009 at 12:27 PM, Ning Zhang 
<[email protected]<mailto:[email protected]>> wrote:
If it is really a Cartesian product, there is no better way other than 
increasing the timeout for the reducers. You can do a back-of-the-envelope 
calculation on how long it takes (e.g., in your log it shows it takes 19 sec. 
to get 6 million rows out of the join). This calculation can also be done if it 
is not a Cartesian product, and you have an good estimate of how many rows will 
be produced.

In general, joining two huge tables can be avoided by partitioning the fact 
tables, so that you don't need to join the whole table.

BTW, are you joining two fact tables or one dimension table is just huge?

Thanks,
Ning

On Oct 26, 2009, at 5:16 AM, Ryan LeCompte wrote:

So it turns out that the JOIN key of my query basically results in a match/join 
on all rows of each table! There really is no extra filtering that I can do to 
exclude invalid rows, etc. The mappers fly by and complete, but the reducers 
are just moving extremely slowly (my guess due to what Zheng said about the 
Cartesian product of all rows getting matched).

Is there some other way that I could re-write the JOIN or is my only option to 
increase the timeout on the task trackers so that they don't timeout/kill the 
reducers? I've already upped their timeouts to 30 minutes (as opposed to the 
default of 10), and it doesn't seem to be sufficient... Again, this is joining 
a 33GB table with a 13GB table where join key is shared by virtually all rows 
in both tables.

Thanks,
Ryan


On Mon, Oct 26, 2009 at 7:35 AM, Ryan LeCompte 
<[email protected]<mailto:[email protected]>> wrote:
Thanks guys, very useful information. I will modify my query a bit and get back 
to you guys on whether it worked or not.

Thanks,
Ryan



On Mon, Oct 26, 2009 at 4:34 AM, Chris Bates 
<[email protected]<mailto:[email protected]>> 
wrote:
Ryan,

I asked this question a couple days ago but in a slightly different form.  What 
you have to do is make sure the table you're joining is smaller than the 
leftmost table.  As an example,

SELECT COUNT(DISTINCT UT.UserID) FROM usertracking UT JOIN streamtransfers ST 
ON (ST.usertrackingid = UT.usertrackingid) WHERE UT.UserID IS NOT NULL AND 
UT.UserID <> 0;

In this query, usertracking is a table that is about 8 or 9 gigs.  
Streamtransfers is a table that is about 4 gigs.  As per Zheng's comment, I 
omitted UserID's of Null or Zero as there are many rows with this key and the 
join worked as intended.

Chris

PS. As an aside, Hive is proving to be quite useful to all of our database 
hackers here at Grooveshark.  Thanks to everyone who has committed...I hope to 
contribute soon.


On Mon, Oct 26, 2009 at 2:08 AM, Zheng Shao 
<[email protected]<mailto:[email protected]>> wrote:
It's probably caused by the Cartesian product of many rows from the two tables 
with the same key.

Zheng


On Sun, Oct 25, 2009 at 7:22 PM, Ryan LeCompte 
<[email protected]<mailto:[email protected]>> wrote:
It also looks like the reducers just never stop outputting things likethe 
(following  -- see below), causing them to ultimately time out and get killed 
by the system.


2009-10-25 22:21:18,879 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 100000000 rows

2009-10-25 22:21:22,009 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 101000000 rows
2009-10-25 22:21:22,010 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 101000000 rows
2009-10-25 22:21:25,141 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 102000000 rows











2009-10-25 22:21:25,142 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 102000000 rows
2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 103000000 rows
2009-10-25 22:21:28,263 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 103000000 rows











2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 104000000 rows
2009-10-25 22:21:31,387 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 104000000 rows
2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 105000000 rows











2009-10-25 22:21:34,510 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 105000000 rows
2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 106000000 rows
2009-10-25 22:21:37,633 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 5 
forwarding 106000000 rows














On Sun, Oct 25, 2009 at 9:39 PM, Ryan LeCompte 
<[email protected]<mailto:[email protected]>> wrote:
Hello all,

Should I expect to be able to do a Hive JOIN between two tables that have about 
10 or 15GB of data each? What I'm noticing (for a simple JOIN) is that all the 
map tasks complete, but the reducers just hang at around 87% or so (for the 
first set of 4 reducers), and then they eventually just get killed due to 
inability to respond by the cluster. I can do a JOIN between a large table and 
a very small table of 10 or so records just fine.

Any thoughts?

Thanks,
Ryan





--
Yours,
Zheng

Re: Issues with joining across large tables

Reply via email to