[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanManjunath

Apache Wiki Wed, 19 Aug 2009 18:17:06 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSkewedJoinSpec

------------------------------------------------------------------------------
  [[Anchor(Implementation)]]
  == Implementation ==
  
- Skewed join translates into two map/reduce jobs - Sample and Join. The first 
job samples the input records and computes a histogram of the underlying key 
space. The second map/reduce job partitions the input table and performs a join 
on the predicate. In order to join the two tables, one of the tables is 
partitioned and other is streamed to the reducer. The map task of this job uses 
the ~-pig.quantiles-~ file to determine the number of reducers per key. It then 
sends the key to each of the reducers in a round robin fashion. Skewed joins 
happen in the reduce phase. 
+ Skewed join translates into two map/reduce jobs - Sample and Join. The first 
job samples the input records and computes a histogram of the underlying key 
space. The second map/reduce job partitions the input table and performs a join 
on the predicate. In order to join the two tables, one of the tables is 
partitioned and other is streamed to the reducer. The map task of the join job 
uses the ~-pig.keydist-~ file to determine the number of reducers per key. It 
then sends the key to each of the reducers in a round robin fashion. Skewed 
joins happen in the reduce phase of the join job.
  
  attachment:partition.jpg
  
  [[Anchor(Sampler_phase)]]
  === Sampler phase ===
- If the underlying data is sufficiently skewed, load imbalances will result in 
a few reducers getting a lot of keys. As a first task, the sampler creates a 
histogram of the key distribution and stores it in the ~-pig.keydist-~ file. In 
order to reduce spillage, the sampler conservatively estimates the number of 
rows that can be sent to a single reducer based on the memory available for the 
reducer. Using this information it creates a map of the reducers and the skewed 
keys. For the table which is partitioned, the partitioner uses the key 
distribution to send the data to the reducer in a round robin fashion. For the 
table which is streamed, the mapper task uses the ~-pig.keydist-~ file to copy 
the data to each of the reduce partitions. 
+ If the underlying data is sufficiently skewed, load imbalances will result in 
a few reducers getting a lot of keys. As a first task, the sampler creates a 
histogram of the key distribution and stores it in the ~-pig.keydist-~ file. In 
order to reduce spillage, the sampler conservatively estimates the number of 
rows that can be sent to a single reducer based on the memory available for the 
reducer. The memory available for the reducer is a product of the heap size and 
the memusage parameter specified by the user. Using this information it creates 
a map of the reducers and the skewed keys. For the table which is partitioned, 
the partitioner uses the key distribution to send the data to the reducer in a 
round robin fashion. For the table which is streamed, the mapper task uses the 
~-pig.keydist-~ file to copy the data to each of the reduce partitions. 
  
- As a first stab at the implementation, we will be using the uniform random 
sampler used by Order BY. The sampler currently does not output the key 
distribution nor the size of the sample record. It will be modified to support 
the same.
+ As a first stab at the implementation, we will be using the random sampler 
used by Order BY. The sampler currently does not output the key distribution 
nor the size of the sample record. It will be modified to support the same.
  
  [[Anchor(Join_phase)]]
  === Join Phase ===
@@ -48, +48 @@

  
  [[Anchor(number_of_reducers)]]
  == Determining the number of reducers per key ==
- The number of reducers for a key is obtained from the key distribution file. 
Along with the distribution, the sampler estimates the number of reducers 
needed for a key by calculating the number of records that fit in a reducer. It 
computes this by estimating the size of the sample and the amount of heap 
available to the jvm for the join operation. The amount of heap is given as a 
config parameter pig.mapred.skewedjoin.heapsize by the user. Knowing the number 
of records per reducer helps minimize disk spillage.
+ The number of reducers for a key is obtained from the key distribution file. 
Along with the distribution, the sampler estimates the number of reducers 
needed for a key by calculating the number of records that fit in a reducer. It 
computes this by estimating the size of the sample and the fraction of heap 
available to the jvm for the join operation. The fraction of heap is provided 
as a config parameter pig.mapred.skewedjoin.memusage by the user. Knowing the 
number of records per reducer helps minimize disk spillage.
  
  [[Anchor(3_way_join)]]
  == Handling 3-way joins ==
@@ -89, +89 @@

  [[Anchor(Usage)]]
  == Usage Notes ==
     * Append 'using "skewed"' construct to the join to force pig to use skewed 
join
-    * Set pig.skewedjoin.reduce.memusage preferably in the range 0.1 - 0.4
+    * Set pig.skewedjoin.reduce.memusage preferably in the range 0.1 - 0.4.
  
  
  [[Anchor(References)]]

[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanManjunath

Reply via email to