Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.

The "LoadStoreRedesignProposal" page has been changed by ThejasNair.
http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=32&rev2=33

--------------------------------------------------

  
  '''Problem 2''':
  !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit 
into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples 
that fit into reducer memory - X. Ie we need to sample one tuple every X/17 
tuples.
- Earlier, the number of tuples to be sampled was calculated before the tuples 
were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of 
samples to be sampled in a map, the formula used was = 
number-of-reducer-memories-needed * 17 / number-of-splits
+ Earlier, the number of tuples to be sampled was calculated before the tuples 
were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of 
samples to be sampled in a map, the formula used was = 
number-of-reducer-memories-needed * 17 / number-of-splits <<BR>>
  Where -
- number-of-reducer-memories-needed = (total_file_size * 
disk_to_mem_factor)/available_reducer_heap_size
+ number-of-reducer-memories-needed = (total_file_size * 
disk_to_mem_factor)/available_reducer_heap_size<<BR>>
  disk_to_mem_factor has default of 2.
  
  Then !PoissonSampleLoader would return sampled tuples by  skipping 
split-size/num_samples bytes at a time.
  
- With new loader we have to skip some number of tuples instead of bytes. But 
we don't have an estimate of total number of tuples in the input.
+ With new loader we have to skip some number of tuples instead of bytes. But 
we don't have an estimate of total number of tuples in the input.<<BR>>
  One way to work around this would be to use size of tuple in memory to 
estimate size of tuple in disk using above disk_to_mem_factor, then number of 
tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples
  
  But the use of disk_to_mem_factor is very dubious, the real 
disk_to_mem_factor will vary based on compression-algorithm, data 
characteristics (sorting etc), and encoding.
  
  '''Solution''':
- The goal is to sample one tuple every X/17 tuples. (X = number of tuples that 
fit in available reducer memory)
+ The goal is to sample one tuple every X/17 tuples. (X = number of tuples that 
fit in available reducer memory).<<BR>>
- To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size
+ To estimate X, we can use 
available_reducer_heap_size/average-tuple-mem-size.<<BR>>
  Number of tuples skipped for every sampled tuple = 1/17 * ( 
available_reducer_heap_size/average-tuple-mem-size)
  
  The average-tuple-mem-size and 
number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new 
tuple is sampled.

Reply via email to