Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by SriranjanManjunath: http://wiki.apache.org/pig/PigSampler ------------------------------------------------------------------------------ For an 1TB file running on nodes which have 512 MB of memory, assuming a conversion factor of 2, the number of base samples turn out to be 4000. + To compute the base number of samples, we calculate the number of partitions of the file based on the amount of memory available and file size. For a 1TB file running on nodes having 512 mb of memory each: + + Threshold partition ''p'',,t,, = 5.0'' E ''8 / (2 * 1.0'' E ''12) = 0.00025 (Since Java is UTF-16, a conversion factor of 2 has been assumed) + + Base number of samples = 1 / ''p'',,t,, = 4000 + + + Assuming we should atleast sample 1 record per partition, we end up with a base number of 4000 samples. + === Estimating the number of samples === The probability that a partition has less than or equal to k samples is predicted by the Poisson cumulative distribution function. Although, the value of k needs to be experimented, a guidance value of 10 is obtained from various sources. A table of cumulative probabilities for a selected range of the sample rate (lambda) and the number of samples per partition is available [http://www.micquality.com/reference_tables/poisson.htm here]. From the table, for a 95% confidence and k (number of samples) set to 10, the sampling rate appears to be 17. Using these numbers, the number of samples that we need to obtain from the input is 68000.