[Hadoop Wiki] Update of "Hive/JoinOptimization" by Liyi nTang

Apache Wiki Wed, 01 Dec 2010 18:19:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hive/JoinOptimization" page has been changed by LiyinTang.
http://wiki.apache.org/hadoop/Hive/JoinOptimization?action=diff&rev1=11&rev2=12

--------------------------------------------------

  '''Fig 1. The Previous Map Join Implementation'''
  
  In Fig 1 above, the previous map join implementation does not scale well when 
the larger table is huge because each Mapper will directly read the small table 
data from HDFS. If the larger table is huge, there will be thousands of Mapper 
launched to read different records of the larger table. And those thousands of 
Mappers will read this small table data from HDFS into their memory, which can 
make access to the small table become the performance bottleneck; or, sometimes 
Mappers will get lots of time-outs for reading this small file, which may cause 
the task to fail.
- 
  
  Hive-1641 
([[http://issues.apache.org/jira/browse/HIVE-1293|http://issues.apache.org/jira/browse/HIVE-1641]])
 has solved this problem, as shown in Fig2 below.
  
@@ -51, +50 @@

  
  '''Fig 5: The Join Execution Flow'''
  
- As shown in fig5, the left side shows the previous Common Join execution 
flow, which is very straightforward. On the other side, the right side is the 
new Common Join execution flow. During the compile time, the query processor 
will generate a Conditional Task, which contains a list of tasks and one of 
these tasks will be resolved to run during the execution time. It means the 
tasks in the Conditional Task's list are the candidates and one of them will be 
chosen to run during the run time. First, the original Common Join Task should 
be put into the list. Also the query processor will generate a series of Map 
Join Task by assuming each of the input tables may be the big table. Take the 
same example as before, '''''select /*+mapjoin(a)*/ * from src1 x  join src2y 
on x.key=y.key'''''. Both table '''''src2 '''''and '''''src1 '''''may be the 
big table, so it will generate 2 Map Join Task. One is assuming src1 is the big 
table and the other is assuming src2 is the big table, as shown in Fig 6.
+ As shown in fig5, the left side shows the previous Common Join execution 
flow, which is very straightforward. On the other side, the right side is the 
new Common Join execution flow. During the compile time, the query processor 
will generate a Conditional Task, which contains a list of tasks and one of 
these tasks will be resolved to run during the execution time. It means the 
tasks in the Conditional Task's list are the candidates and one of them will be 
chosen to run during the run time. First, the original Common Join Task should 
be put into the list. Also the query processor will generate a series of Map 
Join Task by assuming each of the input tables may be the big table. For 
example, '''''select  * from src1 x  join src2y on x.key=y.key'''''. Both table 
'''''src2 '''''and '''''src1 '''''may be the big table, so it will generate 2 
Map Join Task. One is assuming src1 is the big table and the other is assuming 
src2 is the big table, as shown in Fig 6.
  
  {{attachment:fig6.jpg||height="778px",width="1072px"}}

[Hadoop Wiki] Update of "Hive/JoinOptimization" by Liyi nTang

Reply via email to