[Pig Wiki] Update of "PigSampler" by SriranjanManjunath

Apache Wiki Wed, 02 Sep 2009 15:02:44 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSampler

------------------------------------------------------------------------------
  Since the frequency distribution of keys in the input is highly skewed, the 
underlying data can be modeled using a poisson distribution. The skewed join 
sampler tries to identify the keys that are too big to fit in memory and 
allocates reducers to those skewed keys. Given an input file of size N, we need 
to estimate the number of samples which represents this input.
  
  The main purpose of a skewed join sampler is to come up with a reducer 
allocation map of the skewed keys. A custom slicer is used to estimate the 
number of maps that are required to run the sampler job. Although, the number 
of partitions provides us a base number of samples for the input, for an 
uniformly distributed random samples, we may be under sampling the data. Hence, 
we use a Poisson cumulative distribution function to estimate the total number 
of samples that are required to represent the underlying data.
- 
- For an 1TB file running on nodes which have 512 MB of memory, assuming a 
conversion factor of 2, the number of base samples turn out to be 4000. 
  
  To compute the base number of samples, we calculate the number of partitions 
of the file based on the amount of memory available and file size. For a 1TB 
file running on nodes having 512 mb of memory each:

[Pig Wiki] Update of "PigSampler" by SriranjanManjunath

Reply via email to