Apache Wiki
Mon, 16 Nov 2009 10:37:02 -0800
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "LoadStoreRedesignProposal" page has been changed by ThejasNair. http://wiki.apache.org/pig/LoadStoreRedesignProposal?action=diff&rev1=32&rev2=33 -------------------------------------------------- '''Problem 2''': !PoissonSampleLoader samples 17 tuples from every set of tuples that will fit into reducer memory (see PigSkewedJoinSpec) . Let us call this number of tuples that fit into reducer memory - X. Ie we need to sample one tuple every X/17 tuples. - Earlier, the number of tuples to be sampled was calculated before the tuples were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of samples to be sampled in a map, the formula used was = number-of-reducer-memories-needed * 17 / number-of-splits + Earlier, the number of tuples to be sampled was calculated before the tuples were read, in !PoissonSampleLoader.computeSamples(..) . To get the number of samples to be sampled in a map, the formula used was = number-of-reducer-memories-needed * 17 / number-of-splits <<BR>> Where - - number-of-reducer-memories-needed = (total_file_size * disk_to_mem_factor)/available_reducer_heap_size + number-of-reducer-memories-needed = (total_file_size * disk_to_mem_factor)/available_reducer_heap_size<<BR>> disk_to_mem_factor has default of 2. Then !PoissonSampleLoader would return sampled tuples by skipping split-size/num_samples bytes at a time. - With new loader we have to skip some number of tuples instead of bytes. But we don't have an estimate of total number of tuples in the input. + With new loader we have to skip some number of tuples instead of bytes. But we don't have an estimate of total number of tuples in the input.<<BR>> One way to work around this would be to use size of tuple in memory to estimate size of tuple in disk using above disk_to_mem_factor, then number of tuples to be skipped will be = (split-size/avg_mem_size_of_tuple)/numSamples But the use of disk_to_mem_factor is very dubious, the real disk_to_mem_factor will vary based on compression-algorithm, data characteristics (sorting etc), and encoding. '''Solution''': - The goal is to sample one tuple every X/17 tuples. (X = number of tuples that fit in available reducer memory) + The goal is to sample one tuple every X/17 tuples. (X = number of tuples that fit in available reducer memory).<<BR>> - To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size + To estimate X, we can use available_reducer_heap_size/average-tuple-mem-size.<<BR>> Number of tuples skipped for every sampled tuple = 1/17 * ( available_reducer_heap_size/average-tuple-mem-size) The average-tuple-mem-size and number-of-tuples-to-be-skippled-every-sampled-tuple is recalculated after a new tuple is sampled.