[ https://issues.apache.org/jira/browse/HADOOP-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Edward Yoon updated HADOOP-2021: -------------------------------- Description: If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join. {code} R1 = table('movieLog_table'); R2 = table('stockCompany_info'); result = R1.join(R1.studioName = R2.corporation) and R2; {code} ---- {code} r1 a b c ====================== row1 a1 b1 c1 row2 a2 b2 c2 row3 a1 b3 c3 r2 e f ================== row1 e1 a1 row2 e2 f2 row3 e3 f3 row4 e4 a1 row5 e5 a2 r1 = table('r1'); r2 = table('r2'); r3 = r1.join(r1.a = r2.f) and r2; --------------------------------------------- temp table T : Sorted set by "f" row e f =========================== a1 row:row1 e1 a1 row:row4 e4 a2 row:row5 e5 a2 f2 row:row2 e2 f2 f3 row:row3 e3 f3 --------------------------------------------- r3 r1.row a b c r2.row e f =================================================== row1.row1 row1 a1 b1 c1 row1 e1 a1 row1.row4 row1 a1 b1 c1 row4 e4 a1 row2.row5 row2 a2 b2 c2 row5 e5 a2 row3.row1 row3 a1 b3 c3 row1 e1 a1 row3.row4 row3 a1 b3 c3 row4 e4 a1 {code} was: If we don't have an index for a domain in the join, we can still improve on the nested-loop join using sort join. 1. make a sorted set temp file for sort join using MR job 2. make a new Relation table on hbase {code} R1 = table('movieLog_table'); R2 = table('stockCompany_info'); result = R1.join(R1.studioName = R2.corporation) and R2; {code} ---- {code} r1 a b c ====================== row1 a1 b1 c1 row2 a2 b2 c2 row3 a1 b3 c3 r2 e f ================== row1 e1 a1 row2 e2 f2 row3 e3 f3 row4 e4 a1 row5 e5 a2 r1 = table('r1'); r2 = table('r2'); r3 = r1.join(r1.a = r2.f) and r2; --------------------------------------------- temp table T : Sorted set by "f" row e f =========================== a1 row:row1 e1 a1 row:row4 e4 a2 row:row5 e5 a2 f2 row:row2 e2 f2 f3 row:row3 e3 f3 --------------------------------------------- r3 r1.row a b c r2.row e f =================================================== row1.row1 row1 a1 b1 c1 row1 e1 a1 row1.row4 row1 a1 b1 c1 row4 e4 a1 row2.row5 row2 a2 b2 c2 row5 e5 a2 row3.row1 row3 a1 b3 c3 row1 e1 a1 row3.row4 row3 a1 b3 c3 row4 e4 a1 {code} > Sort Join Implementation > ------------------------ > > Key: HADOOP-2021 > URL: https://issues.apache.org/jira/browse/HADOOP-2021 > Project: Hadoop > Issue Type: Sub-task > Components: contrib/hbase > Affects Versions: 0.14.1 > Environment: all environments > Reporter: Edward Yoon > Priority: Minor > Fix For: 0.16.0 > > Attachments: 2021_v01.patch, 2021_v02.txt > > > If we don't have an index for a domain in the join, we can still improve on > the nested-loop join using sort join. > {code} > R1 = table('movieLog_table'); > R2 = table('stockCompany_info'); > result = R1.join(R1.studioName = R2.corporation) and R2; > {code} > ---- > {code} > r1 > a b c > ====================== > row1 a1 b1 c1 > row2 a2 b2 c2 > row3 a1 b3 c3 > r2 > e f > ================== > row1 e1 a1 > row2 e2 f2 > row3 e3 f3 > row4 e4 a1 > row5 e5 a2 > r1 = table('r1'); > r2 = table('r2'); > r3 = r1.join(r1.a = r2.f) and r2; > --------------------------------------------- > temp table T : Sorted set by "f" > row e f > =========================== > a1 row:row1 e1 a1 > row:row4 e4 > a2 row:row5 e5 a2 > f2 row:row2 e2 f2 > f3 row:row3 e3 f3 > --------------------------------------------- > r3 > r1.row a b c r2.row e f > =================================================== > row1.row1 row1 a1 b1 c1 row1 e1 a1 > row1.row4 row1 a1 b1 c1 row4 e4 a1 > row2.row5 row2 a2 b2 c2 row5 e5 a2 > row3.row1 row3 a1 b3 c3 row1 e1 a1 > row3.row4 row3 a1 b3 c3 row4 e4 a1 > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.