Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by AshutoshChauhan:
http://wiki.apache.org/pig/PigMergeJoin

------------------------------------------------------------------------------
  inputs to a merge join.  If users wish to do three plus way joins with this 
algorithm they can decompose their joins into a series of two ways joins.
  
  ----
+ We benchmarked the performance of merge-join. Numbers are in table below. 
+ Since joins produce large number of output rows(n-squared in worst case) they 
are filtered out and not written to disk, so runtimes consist of mostly of CPU 
time.
+ Data size is approximately 1.5 GB for 1M rows.
+ 
  == Performance Benchmark ==
- ||# of rows || # of rows || Sym-Hash Join || Merge-Join||
+ ||# of rows || # of rows || Sym-Hash Join || Merge-Join||Remarks||
  ||100K||100K||        1m30.588s|| 0m41.764s||
  ||5M  ||  5M||         1m5.579s|| 0m51.535s||
  ||15M||        15M||  2m55.812s|| 2m16.906s||
@@ -118, +122 @@

  ||20M||        20M|| 10m30.819s|| 5m42.302s||
  ||20M||  20M|| 12m25.878s|| 5m42.302s||
  ||20M||        20M|| 15m56.533s|| 7m57.646s||
- ||20M || 20M|| 112m20.478s||17m16.542s||   
+ ||20M||  20M||112m20.478s||17m16.542s||Zipf Key Distribution||
- ||100M || 20M|| 30mins    ||  15mins ||
+ ||100M|| 20M|| 30mins    ||   15mins ||
- ||50M || 100M|| 42mins        ||28mins||   
+ ||50M || 100M|| 42mins         ||    28mins||
- ||50M || 100M|| 41mins, 56sec||       28mins||   
- ||50M || 100M|| 42mins||      28mins  ||   
+ ||50M || 100M|| 42mins   ||    28mins||
+ ||50M || 100M|| 42mins   ||  28mins  ||
- ||100M ||100M|| 74mins||      44mins||   
+ ||100M|| 100M|| 74mins   ||    44mins||
  
+ In our tests, we found merge-join is faster then symmetric hash join anywhere 
from 30% to 100% depending on size of data, skew in keys, number of map waves 
run etc.
+ One particular thing which affects merge join is how is input data 
distributed among files. We recommend to have all input files to be 
approximately equal in size or atleast as large as block size.
  ----
  == Phase 2 ==
  

Reply via email to