Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSampler

------------------------------------------------------------------------------
  
  The main purpose of a skewed join sampler is to come up with a reducer 
allocation map of the skewed keys. A custom slicer is used to estimate the 
number of maps that are required to run the sampler job. Although, the number 
of partitions provides us a base number of samples for the input, for an 
uniformly distributed random samples, we may be under sampling the data. Hence, 
we use a Poisson cumulative distribution function to estimate the total number 
of samples that are required to represent the underlying data. The math behind 
the distribution function is attached.
  
- For an 1TB file running on nodes which have 512 MB of memory, assuming a 
conversion factor of 2, the number of base samples turn out to be 4000.
+ For an 1TB file running on nodes which have 512 MB of memory, assuming a 
conversion factor of 2, the number of base samples turn out to be 4000. 
  
  === Estimating the number of samples ===
- The probability that a partition has less than or equal to k samples is 
predicted by the Poisson cumulative distribution function. Although, the value 
of k needs to be experimented, a guidance value of 10 is obtained from various 
sources. A table of cumulative probabilities for a selected range of the sample 
rate (lambda) and the number of samples per partition is attached. From the 
table, for a 95% confidence and k (number of samples) set to 10, the sampling 
rate appears to be 17.
+ The probability that a partition has less than or equal to k samples is 
predicted by the Poisson cumulative distribution function. Although, the value 
of k needs to be experimented, a guidance value of 10 is obtained from various 
sources. A table of cumulative probabilities for a selected range of the sample 
rate (lambda) and the number of samples per partition is attached. From the 
table, for a 95% confidence and k (number of samples) set to 10, the sampling 
rate appears to be 17. Using these numbers, the number of samples that we need 
to obtain from the input is 68000.
  
  == Implementation ==
   * An abstract sampling class will define functions for getSamplingRate and 
skipinterval

Reply via email to