Re: join in hive

Ning Zhang Mon, 26 Oct 2009 00:08:18 -0700

Hive-870 is under implementation, and semijoin is designed to allow map-side 
join as well as at the reduce side.


The idea of rewrite (reduce-side) large natural join into semijoin first and 
then map-side natural join may be better in some situations, but not always 
(remember semijoin here needs to be on the reduce side). The situation that a 
huge table join a very small portion of another huge table seems best fit into 
an indexed join scheme. Hive-417 is the project for supporting indexes in Hive.

Thanks,
Ning

On Oct 25, 2009, at 11:03 PM, Namit Jain wrote:

I wanted to add one more thing,

Ning has started working on semi-join. So, please co-ordinate with him on that.


http://issues.apache.org/jira/browse/HIVE-870





On 10/25/09 9:04 PM, "Zheng Shao" <[email protected]<mailto:[email protected]>> 
wrote:

Mostly correct.

2. Your idea looks interesting but I would say in reality, the percentage of 
tuples purged may not be that large.
4. Hive does NOT treat the partition column differently than others.
5. There is no sort-merge join yet. This would be a great feature to add onto 
Hive!

Zheng

2009/10/25 Gang Luo <[email protected]<mailto:[email protected]>>
Hi everyone,
I am going to do some interesting things for join in hive. Before I read the 
source code, could anyone tell me what kinds of join have been implemented in 
the newest version of hive?

Right now, what I have known are:
1. symmetric join has been implemented, which is the default join.
2. asymmetric join, a.k.s. map-side join (for joining a huge table and a small 
table and only use the map phase), has been implemented. But no optimization 
was added. If so, what I think is when we meet two huge tables, we can use 
semi-join to first get rid of the non-referenced tuples in one tables making it 
smaller, and then do the map-side join.
3. 3-way join (only use one map-reduce job to join 3 tables) was implemented, 
but only applied for joining on the same join key (A.k=B.k && B.k=C.k). If we 
want to join 3 tables on different join keys (A.k1=B.k1 & B.k2=C.k2), we still 
need 2 map-reduce jobs.
4. when joining two tables, hive could tell whether the join key is a 
partitioned column, and make good use of this partition feature.
5. no sort-merge join was implemented in hive right now, thus we cannot do the 
in-equi join.

There may be many mistakes in my understanding. Please point it out or give me 
further information about join in hive. Thanks so much.


Luo, Gang
---------
Department of Computer Science
Duke University
(919)316-0993
[email protected]<mailto:[email protected]>


________________________________
 好玩贺卡等你发，邮箱贺卡全新上线！ 
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/><http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/>

Re: join in hive

Reply via email to