Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by SriranjanManjunath: http://wiki.apache.org/pig/PigSampler ------------------------------------------------------------------------------ The main purpose of a skewed join sampler is to come up with a reducer allocation map of the skewed keys. A custom slicer is used to estimate the number of maps that are required to run the sampler job. Although, the number of partitions provides us a base number of samples for the input, for an uniformly distributed random samples, we may be under sampling the data. Hence, we use a Poisson cumulative distribution function to estimate the total number of samples that are required to represent the underlying data. The math behind the distribution function is attached. - For an 1TB file running on nodes which have 512 MB of memory, assuming a conversion factor of 2, the number of base samples turn out to be 4000. + For an 1TB file running on nodes which have 512 MB of memory, assuming a conversion factor of 2, the number of base samples turn out to be 4000. === Estimating the number of samples === - The probability that a partition has less than or equal to k samples is predicted by the Poisson cumulative distribution function. Although, the value of k needs to be experimented, a guidance value of 10 is obtained from various sources. A table of cumulative probabilities for a selected range of the sample rate (lambda) and the number of samples per partition is attached. From the table, for a 95% confidence and k (number of samples) set to 10, the sampling rate appears to be 17. + The probability that a partition has less than or equal to k samples is predicted by the Poisson cumulative distribution function. Although, the value of k needs to be experimented, a guidance value of 10 is obtained from various sources. A table of cumulative probabilities for a selected range of the sample rate (lambda) and the number of samples per partition is attached. From the table, for a 95% confidence and k (number of samples) set to 10, the sampling rate appears to be 17. Using these numbers, the number of samples that we need to obtain from the input is 68000. == Implementation == * An abstract sampling class will define functions for getSamplingRate and skipinterval
