Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by SriranjanManjunath: http://wiki.apache.org/pig/PigSkewedJoinSpec ------------------------------------------------------------------------------ C = JOIN big BY b1, massive BY m1 USING "skewed"; }}} + + In order to use skewed join, + + * Append 'using "skewed"' construct to the join to force pig to use skewed join + * pig.skewedjoin.reduce.memusage specifies the fraction of heap available for the reducer to perform the join. A low fraction forces pig to use more reducers but increases copying cost. For pigmix tests, we have seen good performance when we set this value in the range 0.1 - 0.4. However, note that this is hardly an accurate range. Its value depends on the amount of heap available for the operation, the number of columns in the input and the skew. It is best obtained by conducting experiments to achieve a good performance. + + [[Anchor(Requirements)]] == Requirements == @@ -65, +72 @@ == Skewed Join performance == We have run the PigMix suite L3 test on a Hadoop cluster to compare skewed join with the regular join. On an average of 3 runs, skewed join took around 24 hours 30 minutes to complete whereas the regular join had to be killed after running for 5 days. - We conducted various performance tests to come up with a "magic" value for the memusage parameter. Here are the results: + We conducted various performance tests to come up with a "magic" value for the memusage parameter. We ran the pigmix suite L3 query to join an input with 9 columns with an input with 2 columns. Here are the results: ||Number of tuples||Number of Reducers||Total Time||Memusage|| ||262159 x 2607||2||8min 10sec||0.5|| ||262159 x 2607||3||5min 8sec||0.3|| @@ -84, +91 @@ ||262159 x 26195||90||3min 56sec||0.01|| ||262159 x 26195||112||4min 42sec||0.008|| - As evident from the results, the performance of skewed join varies significantly with the value of memusage. We will advise keeping a low value for memusage, thus using multiple reducers for the join. Note that setting an extremely low value increases the copying cost since the streaming table now needs to be copied to more reducers. We have seen good performance when this value was set in the range of 0.1 - 0.4. + When both the inputs had 2 columns, the value of memusage had to be even lower: - [[Anchor(Usage)]] - == Usage Notes == - * Append 'using "skewed"' construct to the join to force pig to use skewed join - * Set pig.skewedjoin.reduce.memusage preferably in the range 0.1 - 0.4. + ||Number of tuples||Number of Reducers||Total Time||Memusage|| + ||262159 x 2607||1||11min 57sec||0.5|| + ||262159 x 2607||1||11min 57sec||0.3|| + ||262159 x 2607||1||11min 57sec||0.2|| + ||262159 x 2607||1||11min 57sec||0.1|| + ||262159 x 2607||1||11min 57sec||0.05|| + ||262159 x 2607||2||6min 22sec||0.025|| + ||262159 x 2607||5||2min 40sec||0.01|| + ||262159 x 2607||6||2min 19sec||0.008|| + ||262159 x 2607||14||1min 16sec||0.003|| + ||262159 x 2607||42||1min 8sec||0.001|| + ||262159 x 2607||83||1min 7sec||0.0005|| + ||262159 x 26195||1||113min 48sec||0.5|| + ||262159 x 26195||1||113min 48sec||0.3|| + ||262159 x 26195||1||113min 48sec||0.2|| + ||262159 x 26195||1||113min 48sec||0.1|| + ||262159 x 26195||1||113min 48sec||0.05|| + ||262159 x 26195||2||60min 17sec||0.025|| + ||262159 x 26195||5||23min 35sec||0.01|| + ||262159 x 26195||6||20min 9sec||0.008|| + ||262159 x 26195||14||9min 20sec||0.003|| + ||262159 x 26195||42||5min 41sec||0.001|| + ||262159 x 26195||83||3min 42sec||0.0005|| + + + As evident from the results, the performance of skewed join varies significantly with the value of memusage. We will advise keeping a low value for memusage, thus using multiple reducers for the join. Note that setting an extremely low value increases the copying cost since the streaming table now needs to be copied to more reducers. We have seen good performance when this value was set in the range of 0.1 - 0.4 for the pigmix tests. [[Anchor(References)]]