Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSkewedJoinSpec

------------------------------------------------------------------------------
  
  C = JOIN big BY b1, massive BY m1 USING "skewed";
  }}}
+ 
+ In order to use skewed join,
+ 
+    * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
+    * pig.skewedjoin.reduce.memusage specifies the fraction of heap available 
for the reducer to perform the join. A low fraction forces pig to use more 
reducers but increases copying cost. For pigmix tests, we have seen good 
performance when we set this value in the range 0.1 - 0.4. However, note that 
this is hardly an accurate range. Its value depends on the amount of heap 
available for the operation, the number of columns in the input and the skew. 
It is best obtained by conducting experiments to achieve a good performance.
+ 
+ 
  [[Anchor(Requirements)]]
  == Requirements ==
  
@@ -65, +72 @@

  == Skewed Join performance ==
  We have run the PigMix suite L3 test on a Hadoop cluster to compare skewed 
join with the regular join. On an average of 3 runs, skewed join took around 24 
hours 30 minutes to complete whereas the regular join had to be killed after 
running for 5 days.
  
- We conducted various performance tests to come up with a "magic" value for 
the memusage parameter. Here are the results:
+ We conducted various performance tests to come up with a "magic" value for 
the memusage parameter. We ran the pigmix suite L3 query to join an input with 
9 columns with an input with 2 columns. Here are the results:
  ||Number of tuples||Number of Reducers||Total Time||Memusage||
  ||262159 x 2607||2||8min 10sec||0.5||
  ||262159 x 2607||3||5min 8sec||0.3||
@@ -84, +91 @@

  ||262159 x 26195||90||3min 56sec||0.01||
  ||262159 x 26195||112||4min 42sec||0.008||
  
- As evident from the results, the performance of skewed join varies 
significantly with the value of memusage. We will advise keeping a low value 
for memusage, thus using multiple reducers for the join. Note that setting an 
extremely low value increases the copying cost since the streaming table now 
needs to be copied to more reducers. We have seen good performance when this 
value was set in the range of 0.1 - 0.4.
+ When both the inputs had 2 columns, the value of memusage had to be even 
lower:
  
- [[Anchor(Usage)]]
- == Usage Notes ==
-    * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
-    * Set pig.skewedjoin.reduce.memusage preferably in the range 0.1 - 0.4.
+ ||Number of tuples||Number of Reducers||Total Time||Memusage||
+ ||262159 x 2607||1||11min 57sec||0.5||
+ ||262159 x 2607||1||11min 57sec||0.3||
+ ||262159 x 2607||1||11min 57sec||0.2||
+ ||262159 x 2607||1||11min 57sec||0.1||
+ ||262159 x 2607||1||11min 57sec||0.05||
+ ||262159 x 2607||2||6min 22sec||0.025||
+ ||262159 x 2607||5||2min 40sec||0.01||
+ ||262159 x 2607||6||2min 19sec||0.008||
+ ||262159 x 2607||14||1min 16sec||0.003||
+ ||262159 x 2607||42||1min 8sec||0.001||
+ ||262159 x 2607||83||1min 7sec||0.0005||
+ ||262159 x 26195||1||113min 48sec||0.5||
+ ||262159 x 26195||1||113min 48sec||0.3||
+ ||262159 x 26195||1||113min 48sec||0.2||
+ ||262159 x 26195||1||113min 48sec||0.1||
+ ||262159 x 26195||1||113min 48sec||0.05||
+ ||262159 x 26195||2||60min 17sec||0.025||
+ ||262159 x 26195||5||23min 35sec||0.01||
+ ||262159 x 26195||6||20min 9sec||0.008||
+ ||262159 x 26195||14||9min 20sec||0.003||
+ ||262159 x 26195||42||5min 41sec||0.001||
+ ||262159 x 26195||83||3min 42sec||0.0005||
+ 
+ 
+ As evident from the results, the performance of skewed join varies 
significantly with the value of memusage. We will advise keeping a low value 
for memusage, thus using multiple reducers for the join. Note that setting an 
extremely low value increases the copying cost since the streaming table now 
needs to be copied to more reducers. We have seen good performance when this 
value was set in the range of 0.1 - 0.4 for the pigmix tests.
  
  
  [[Anchor(References)]]

Reply via email to