Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSkewedJoinSpec

------------------------------------------------------------------------------
     * In the first phase the skewed join uses the order by sampling to compute 
a histogram of the records. It then relies on user configs to pass the 
intermediate keys to the right reducers.
     * In the second phase the current uniform random sampling used by order by 
will be replaced by a block level sampler which will avoid the problem of 
over-sampling the data for large inputs.
  
+ [[Anchor(Performance)]]
+ == Skewed Join performance ==
+ We have run the PigMix suite L3 test on a Hadoop cluster to compare skewed 
join with the regular join. On an average of 3 runs, skewed join took around 24 
hours 30 minutes to complete whereas the regular join had to be killed after 
running for 5 days.
+ 
+ We conducted various performance tests to come up with a "magic" value for 
the memusage parameter. Here are the results:
+ ||Number of tuples||Number of Reducers||Total Time||Memusage||
+ ||262159 x 2607||2||8min 10sec||0.5||
+ ||262159 x 2607||3||5min 8sec||0.3||
+ ||262159 x 2607||5||3min 23sec||0.2||
+ ||262159 x 2607||9||2min 6 sec||0.1||
+ ||262159 x 2607||18||1min 15sec||0.05||
+ ||262159 x 2607||36||1min 12sec||0.025||
+ ||262159 x 2607||90||1min 13sec||0.01||
+ ||262159 x 2607||112||1min 17sec||0.008||
+ ||262159 x 26195||2||77min 10sec||0.5||
+ ||262159 x 26195||3||47min 58sec||0.3||
+ ||262159 x 26195||5||27min 47sec||0.2||
+ ||262159 x 26195||9||16min 38sec||0.1||
+ ||262159 x 26195||18||8min 31sec||0.05||
+ ||262159 x 26195||36||4min 37sec||0.025||
+ ||262159 x 26195||90||3min 56sec||0.01||
+ ||262159 x 26195||112||4min 42sec||0.008||
+ 
+ As evident from the results, the performance of skewed join varies 
significantly with the value of memusage. We will advise keeping a low value 
for memusage, thus using multiple reducers for the join. Note that setting an 
extremely low value increases the copying cost since the streaming table now 
needs to be copied to more reducers. We have seen good performance when this 
value was set in the range of 0.1 - 0.4.
+ 
+ [[Anchor(Usage)]]
+ == Usage Notes ==
+    * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
+    * Set pig.skewedjoin.more ~/.pig 
+ 
+ 
  [[Anchor(References)]]
  == References ==
     (1) "Practical Skew Handling in Parallel Joins" - David J. Dewitt, Jeffrey 
F. Naughton, Donovan A. Schneider, S. Seshadri

Reply via email to