Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSampler

New page:
Currently, the sampler used by pig has a few limitations:
 * It samples one in 100 records per block, which could result in under/over 
sampling of the data.
 * For extremely large inputs, 100 records per block results in over sampling 
of the data. These samples can then burden the single reducer which aggregates 
these samples.
 * Different operations need a different sampler. We thus need a generic 
sampling interface.

== Proposed changes ==
 * Addition of a "Sampler" interface that sample loaders must implement. The 
existing RandomSampleLoader will be modified to implement the same.
 * Order By will continue to use the existing RandomSampleLoader where as 
SkewedJoin will define a new Sampler. The distinction is important since the 
sample rate is different between the two and the sample rate for skewed join 
will not be known during the compilation phase.
 * Skewed Join sampler will estimate the number of samples based on the size of 
the input.
 * Using a more uniform distribution for the skewed join sample loader instead 
of making it random. The distribution can be generated offline and stored in a 
file and later used by the sample loader to pick the samples.

== Order By sampler ==
The existing order by samples the input data to get a sense of the distribution 
of ordering keys. Data is sampled 1 per 100 records by using 
RandomSampleLoader. This loader is a subclass of BinStorage and is able to skip 
all the records between samples. To calculate quantiles, the samples are 
divided into several partitions that is equal to the number of quantiles. The 
last record of each partition is retrieved to form a quantile list. It then 
counts the occurrence of the keys that fall in the quantile list for each 
partition. The probability that a key falls into a partition is then calculated 
and written to the quantile file along with the quantile list.

== Skewed Join sampler ==
Since the frequency distribution of keys in the input is highly skewed, the 
underlying data can be modeled using a poisson distribution. The skewed join 
sampler tries to identify the keys that are too big to fit in memory and 
allocates reducers to those skewed keys. Given an input file of size N, we need 
to estimate the number of samples which represents this input.

The main purpose of a skewed join sampler is to come up with a reducer 
allocation map of the skewed keys. A custom slicer is used to estimate the 
number of maps that are required to run the sampler job. Although, the number 
of partitions provides us a base number of samples for the input, for an 
uniformly distributed random samples, we may be under sampling the data. Hence, 
we use a Poisson cumulative distribution function to estimate the total number 
of samples that are required to represent the underlying data. The math behind 
the distribution function is attached.

For an 1TB file running on nodes which have 512 MB of memory, assuming a 
conversion factor of 2, the number of base samples turn out to be 4000.

=== Estimating the number of samples ===
The probability that a partition has less than or equal to k samples is 
predicted by the Poisson cumulative distribution function. Although, the value 
of k needs to be experimented, a guidance value of 10 is obtained from various 
sources. A table of cumulative probabilities for a selected range of the sample 
rate (lambda) and the number of samples per partition is attached. From the 
table, for a 95% confidence and k (number of samples) set to 10, the sampling 
rate appears to be 17.

== Implementation ==
 * An abstract sampling class will define functions for getSamplingRate and 
skipinterval
 * The existing RandomSampleLoader will extend this sampling class. It will set 
the sampling rate to 100 which is similar to the existing implementation for 
order by.
 * Skewed join's sampler will extend the sampler class and set the sampling 
rate as described above.
 * The skip interval for the skewed join sampler will be uniform. In the 
future, this can be replaced by a uniformly distributed random interval.

Reply via email to