Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The following page has been changed by SriranjanManjunath:
http://wiki.apache.org/pig/PigSampler

------------------------------------------------------------------------------
  
  For an 1TB file running on nodes which have 512 MB of memory, assuming a 
conversion factor of 2, the number of base samples turn out to be 4000. 
  
+ To compute the base number of samples, we calculate the number of partitions 
of the file based on the amount of memory available and file size. For a 1TB 
file running on nodes having 512 mb of memory each: 
+ 
+ Threshold partition ''p'',,t,,  = 5.0'' E ''8 / (2 * 1.0'' E ''12) = 0.00025 
(Since Java is UTF-16, a conversion factor of 2 has been assumed)
+ 
+ Base number of samples = 1 / ''p'',,t,, = 4000
+ 
+ 
+ Assuming we should atleast sample 1 record per partition, we end up with a 
base number of 4000 samples.
+ 
  === Estimating the number of samples ===
  The probability that a partition has less than or equal to k samples is 
predicted by the Poisson cumulative distribution function. Although, the value 
of k needs to be experimented, a guidance value of 10 is obtained from various 
sources. A table of cumulative probabilities for a selected range of the sample 
rate (lambda) and the number of samples per partition is available 
[http://www.micquality.com/reference_tables/poisson.htm here]. From the table, 
for a 95% confidence and k (number of samples) set to 10, the sampling rate 
appears to be 17. Using these numbers, the number of samples that we need to 
obtain from the input is 68000.
  

Reply via email to