[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanManjunath

Apache Wiki Thu, 07 May 2009 13:01:45 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSkewedJoinSpec

------------------------------------------------------------------------------
  [[Anchor(Intro)]]
  == Introduction ==
  
- Parallel joins are vulnerable to the presence of skew in the underlying data. 
If the underlying data is sufficiently skewed, load imbalances will swamp any 
of the parallelism gains [#References (1)]. In order to counteract this 
problem, skewed join computes a histogram of the key space and uses this data 
to allocate reducers for a given key. Skewed join does not place a restriction 
on the size of the input tables. It accomplishes this by splitting one of the 
input table on the join predicate and streaming the other table.
+ Parallel joins are vulnerable to the presence of skew in the underlying data. 
If the underlying data is sufficiently skewed, load imbalances will swamp any 
of the parallelism gains [#References (1)]. In order to counteract this 
problem, skewed join computes a histogram of the key space and uses this data 
to allocate reducers for a given key. Skewed join does not place a restriction 
on the size of the input keys. It accomplishes this by splitting one of the 
input on the join predicate and streaming the other input.
  [[Anchor(Use_cases)]]
  == Use cases ==
  
@@ -32, +32 @@

  
  [[Anchor(Sampler_phase)]]
  === Sampler phase ===
- If the underlying data is sufficiently skewed, load imbalances will result in 
a few reducers getting a lot of keys. As a first task, the sampler creates a 
histogram of the key distribution and stores it in the ~-pig.keydist-~ file. 
This key distribution will be used to allocate the right number of reducers for 
a key. For the table which is partitioned, the partitioner uses the key 
distribution to copy the output to the reducer buffer regions in a round robin 
fashion. For the table which is streamed, the mapper task uses the 
~-pig.keydist-~ file to copy the data to each of the reduce partitions. 
+ If the underlying data is sufficiently skewed, load imbalances will result in 
a few reducers getting a lot of keys. As a first task, the sampler creates a 
histogram of the key distribution and stores it in the ~-pig.keydist-~ file. 
This key distribution will be used to allocate the right number of reducers for 
a key. For the table which is partitioned, the partitioner uses the key 
distribution to send the data to the reducer in a round robin fashion. For the 
table which is streamed, the mapper task uses the ~-pig.keydist-~ file to copy 
the data to each of the reduce partitions. 
  
- As a first stab at the implementation, we will be using the uniform random 
sampler used by Order BY. The sampler currently does not output the key 
distribution. It will be modified to support the same.
+ As a first stab at the implementation, we will be using the uniform random 
sampler used by Order BY. The sampler currently does not output the key 
distribution nor the size of the sample record. It will be modified to support 
the same.
  [[Anchor(Sort_phase)]]
  === Sort phase ===
  The keys are sorted based on the input predicate.
  [[Anchor(Join_phase)]]
  === Join Phase ===
- Skewed join happens in the reduce phase. As a convention, the first table in 
the join command is partitioned and sent to the various reducers. Partitioning 
allows us to support massive tables without having to worry about the memory 
limitations. The partitioner is overridden to send the data in a round robin 
fashion to each of the reducers associated with a key. The partitioner obtains 
the reducer information from the key distribution file. To counteract the 
issues with reducer starvation (i.e. the keys that require more than 1 reducer 
are granted the reducers whereas the other keys are starved for the reducers), 
the user is allowed to set a config parameter 
pig.mapreduce.skewedjoin.uniqreducers. The value is a percentage of unique 
reducers the partitioner should use. For ex: if the value is 90, 10% of the 
total reducers will be used for highly skewed data.
+ Skewed join happens in the reduce phase. As a convention, the first table in 
the join command is partitioned and sent to the various reducers. Partitioning 
allows us to support massive tables without having to worry about the memory 
limitations. The partitioner is overridden to send the data in a round robin 
fashion to each of the reducers associated with a key. The partitioner obtains 
the reducer information from the key distribution file. To counteract the 
issues with reducer starvation (i.e. the keys that require more than 1 reducer 
are granted the reducers whereas the other keys are starved for the reducers), 
the user is allowed to set a config parameter 
pig.mapreduce.skewedjoin.uniqreducers. The value is a percentage of unique 
reducers the partitioner should use. For ex: if the value is 90, 10% of the 
total reducers will be used for highly skewed data. If the input is highly 
skewed and the number of reducers is very low, the task will bail out and 
report an error.
  
  For the streaming table, since more than one reducer can be associated with a 
key, the streamed table records (that match the key) needs to be copied over to 
each of these reducers. The mapper function uses the key distribution in 
~-pig.keydist-~ file to copy the records over to each of the partition. It 
accomplishes this be inserting a [#PRop PRop] to the logical plan. The [#PRop 
PRop] sets a partition index to each of the key/value pair which is then used 
by the partitioner to send the pair to the right reducer.

[Pig Wiki] Update of "PigSkewedJoinSpec" by SriranjanManjunath

Reply via email to