[jira] Updated: (HADOOP-2021) Sort Join Implementation

Edward Yoon (JIRA) Tue, 20 Nov 2007 01:07:07 -0800

     [ 
https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Edward Yoon updated HADOOP-2021:
--------------------------------

    Description: 
If we don't have an index for a domain in the join, we can still improve on the 
nested-loop join using sort join.

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}





  was:
If we don't have an index for a domain in the join, we can still improve on the 
nested-loop join using sort join.

1. make a sorted set temp file for sort join using MR job 
2. make a new Relation table on hbase

{code}
R1 = table('movieLog_table');
R2 = table('stockCompany_info');
result = R1.join(R1.studioName = R2.corporation) and R2;
{code}

----
{code}
r1
       a     b    c
======================
row1   a1    b1   c1
row2   a2    b2   c2
row3   a1    b3   c3

r2
       e     f
==================
row1   e1    a1
row2   e2    f2
row3   e3    f3
row4   e4    a1
row5   e5    a2

r1 = table('r1');
r2 = table('r2');
r3 = r1.join(r1.a = r2.f) and r2;

---------------------------------------------
temp table T : Sorted set by "f"

    row       e     f
===========================
a1  row:row1  e1    a1
    row:row4  e4    
a2  row:row5  e5    a2
f2  row:row2  e2    f2
f3  row:row3  e3    f3
---------------------------------------------

r3
           r1.row   a    b    c   r2.row    e   f  
===================================================
row1.row1  row1     a1   b1   c1  row1      e1  a1
row1.row4  row1     a1   b1   c1  row4      e4  a1
row2.row5  row2     a2   b2   c2  row5      e5  a2
row3.row1  row3     a1   b3   c3  row1      e1  a1
row3.row4  row3     a1   b3   c3  row4      e4  a1
{code}






> Sort Join Implementation
> ------------------------
>
>                 Key: HADOOP-2021
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2021
>             Project: Hadoop
>          Issue Type: Sub-task
>          Components: contrib/hbase
>    Affects Versions: 0.14.1
>         Environment: all environments  
>            Reporter: Edward Yoon
>            Priority: Minor
>             Fix For: 0.16.0
>
>         Attachments: 2021_v01.patch, 2021_v02.txt
>
>
> If we don't have an index for a domain in the join, we can still improve on 
> the nested-loop join using sort join.
> {code}
> R1 = table('movieLog_table');
> R2 = table('stockCompany_info');
> result = R1.join(R1.studioName = R2.corporation) and R2;
> {code}
> ----
> {code}
> r1
>        a     b    c
> ======================
> row1   a1    b1   c1
> row2   a2    b2   c2
> row3   a1    b3   c3
> r2
>        e     f
> ==================
> row1   e1    a1
> row2   e2    f2
> row3   e3    f3
> row4   e4    a1
> row5   e5    a2
> r1 = table('r1');
> r2 = table('r2');
> r3 = r1.join(r1.a = r2.f) and r2;
> ---------------------------------------------
> temp table T : Sorted set by "f"
>     row       e     f
> ===========================
> a1  row:row1  e1    a1
>     row:row4  e4    
> a2  row:row5  e5    a2
> f2  row:row2  e2    f2
> f3  row:row3  e3    f3
> ---------------------------------------------
> r3
>            r1.row   a    b    c   r2.row    e   f  
> ===================================================
> row1.row1  row1     a1   b1   c1  row1      e1  a1
> row1.row4  row1     a1   b1   c1  row4      e4  a1
> row2.row5  row2     a2   b2   c2  row5      e5  a2
> row3.row1  row3     a1   b3   c3  row1      e1  a1
> row3.row4  row3     a1   b3   c3  row4      e4  a1
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-2021) Sort Join Implementation

Reply via email to