# [Pig Wiki] Update of "PigSampler" by SriranjanManjunath

```Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change
```
The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSampler

------------------------------------------------------------------------------
== Skewed Join sampler ==
Since the frequency distribution of keys in the input is highly skewed, the
underlying data can be modeled using a poisson distribution. The skewed join
sampler tries to identify the keys that are too big to fit in memory and
allocates reducers to those skewed keys. Given an input file of size N, we need
to estimate the number of samples which represents this input.

- The main purpose of a skewed join sampler is to come up with a reducer
allocation map of the skewed keys. A custom slicer is used to estimate the
number of maps that are required to run the sampler job. Although, the number
of partitions provides us a base number of samples for the input, for an
uniformly distributed random samples, we may be under sampling the data. Hence,
we use a Poisson cumulative distribution function to estimate the total number
of samples that are required to represent the underlying data. The math behind
the distribution function is attached.
+ The main purpose of a skewed join sampler is to come up with a reducer
allocation map of the skewed keys. A custom slicer is used to estimate the
number of maps that are required to run the sampler job. Although, the number
of partitions provides us a base number of samples for the input, for an
uniformly distributed random samples, we may be under sampling the data. Hence,
we use a Poisson cumulative distribution function to estimate the total number
of samples that are required to represent the underlying data.

For an 1TB file running on nodes which have 512 MB of memory, assuming a
conversion factor of 2, the number of base samples turn out to be 4000.

=== Estimating the number of samples ===
- The probability that a partition has less than or equal to k samples is
predicted by the Poisson cumulative distribution function. Although, the value
of k needs to be experimented, a guidance value of 10 is obtained from various
sources. A table of cumulative probabilities for a selected range of the sample
rate (lambda) and the number of samples per partition is attached. From the
table, for a 95% confidence and k (number of samples) set to 10, the sampling
rate appears to be 17. Using these numbers, the number of samples that we need
to obtain from the input is 68000.
+ The probability that a partition has less than or equal to k samples is
predicted by the Poisson cumulative distribution function. Although, the value
of k needs to be experimented, a guidance value of 10 is obtained from various
sources. A table of cumulative probabilities for a selected range of the sample
rate (lambda) and the number of samples per partition is available
[http://www.micquality.com/reference_tables/poisson.htm here]. From the table,
for a 95% confidence and k (number of samples) set to 10, the sampling rate
appears to be 17. Using these numbers, the number of samples that we need to
obtain from the input is 68000.

== Implementation ==
* An abstract sampling class will define functions for getSamplingRate and
skipinterval
@@ -30, +30 @@

* Skewed join's sampler will extend the sampler class and set the sampling
rate as described above.
* The skip interval for the skewed join sampler will be uniform. In the
future, this can be replaced by a uniformly distributed random interval.

+ == References ==
+  * Poisson Distribution, Wikipedia
+  * Rhodes, Lee "Sampling for Keys too large to fit in memory"
+
```