Re： join in hive

Gang Luo Mon, 26 Oct 2009 05:50:08 -0700

Thanks you, Zheng and Ning. 

 Luo, Gang
---------
Department of Computer Science
Duke University
(919)316-0993
[email protected]







________________________________
发件人： Ning Zhang <[email protected]>
收件人： "[email protected]" <[email protected]>
已发送： 周一, 10 26, 2009 3:07:49 上午
主   题： Re: join in hive

Hive-870 is under implementation, and semijoin is designed to allow map-side 
join as well as at the reduce side. 

The idea of rewrite (reduce-side) large natural join into semijoin first and 
then map-side natural join may be better in some situations, but not always 
(remember semijoin here needs to be on the reduce side). The situation that a 
huge table join a very small portion of another huge table seems best fit into 
an indexed join scheme. Hive-417 is the project for supporting indexes in Hive. 

Thanks,
Ning


On Oct 25, 2009, at 11:03 PM, Namit Jain wrote:

I wanted to add one more thing, 
>
>>Ning has started working on semi-join. So, please co-ordinate with him on 
>>that.
>
>
>http://issues.apache.org/jira/browse/HIVE-870
>
>
>
>
>
>>On 10/25/09 9:04 PM, "Zheng Shao" <[email protected]> wrote:
>
>
>Mostly correct.
>>
>>>>2. Your idea looks interesting but I would say in reality, the percentage 
>>>>of tuples purged may not be that large.
>>>>4. Hive does NOT treat the partition column differently than others.
>>>>5. There is no sort-merge join yet. This would be a great feature to add 
>>>>onto Hive!
>>
>>>>Zheng
>>
>>>>2009/10/25 Gang Luo <[email protected]>
>>
>>Hi everyone,
>>>>>>I am going to do some interesting things for join in hive. Before I read 
>>>>>>the source code, could anyone tell me what kinds of join have been 
>>>>>>implemented in the newest version of hive?
>>>
>>>>>>Right now, what I have known are:
>>>>>>1. symmetric join has been implemented, which is the default join.
>>>>>>2. asymmetric join, a.k.s. map-side join (for joining a huge table and a 
>>>>>>small table and only use the map phase), has been implemented. But no 
>>>>>>optimization was added. If so, what I think is when we meet two huge 
>>>>>>tables, we can use semi-join to first get rid of the non-referenced 
>>>>>>tuples in one tables making it smaller, and then do the map-side join.
>>>>>>3. 3-way join (only use one map-reduce job to join 3 tables) was 
>>>>>>implemented, but only applied for joining on the same join key (A.k=B.k 
>>>>>>&& B.k=C.k). If we want to join 3 tables on different join keys 
>>>>>>(A.k1=B..k1 & B.k2=C.k2), we still need 2 map-reduce jobs.
>>>>>>4. when joining two tables, hive could tell whether the join key is a 
>>>>>>partitioned column, and make good use of this partition feature.
>>>>>>5. no sort-merge join was implemented in hive right now, thus we cannot 
>>>>>>do the in-equi join. 
>>>
>>>>>>There may be many mistakes in my understanding. Please point it out or 
>>>>>>give me further information about join in hive. Thanks so much.
>>>
>>>>>> 
>>>>>>Luo, Gang
>>>---------
>>>>>>Department of Computer Science
>>>>>>Duke University
>>>>>>(919)316-0993
>>>[email protected]
>>>
>>>>>> 
>>>________________________________
  好玩贺卡等你发，邮箱贺卡全新上线！ 
<http://cn.rd.yahoo.com/mail_cn/tagline/card/*http://card.mail.cn.yahoo.com/> 
>>>
>>>
>>>



      ___________________________________________________________ 
  好玩贺卡等你发，邮箱贺卡全新上线！ 
http://card.mail.cn.yahoo.com/

Re： join in hive

Reply via email to