[ https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772197#action_12772197 ]
Thejas M Nair commented on PIG-1062: ------------------------------------ Dmitriy, I had overlooked the fact that input size of the file is being used also to calculate the number of samples. Thanks for pointing it out. I don't know if there are any problems in using counters directly, as long as information is required only after (first mapreduce) sampling phase, ie it could be used in PartitionSkewedKey(). The logic in PoissonSampleLoader.computeSamples is ( a detailed explanation will be added soon to the sampler wiki page). - The goal is to sample all keys from the first input that are will need to be partitioned across multiple reducers in the join phase. Let us assume X tuples fit into available memory in reducer. Lets say we want to sample 10 samples in each set of X tuples, with 95% confidence. Using poisson distribution formulas, we arrive at the number 17 as number of tuples to be sampled every X tuples. ( I don't know why poisson distrubution is the appropriate choice ) The total number of tuples to be sampled cannot be calculated without knowing total number of tuples. But what we know is that we should sample one tuple every (X/17) tuples. To calculate X, we need the average size of tuple in memory. Using the process memory usage is unlikely to give good approximation of that, because (as per my understanding) calling the garbage collector is not guaranteed to free memory used by all unused objects. Tuple.getMemorySize() can be used to get an estimate of the memory used by the tuple. The average size could be estimated/corrected as we sample more tuples. ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. (using H, s in previous comment) In PartitionSkewedKey.exec(), Dmitriy's idea of using number of samples, and sample rate (H/s) can be used to estimate total tuples. WeightedRangePartitioner.setConf is another function using fileSize(). That needs to change as well. I haven't looked at that yet. > load-store-redesign branch: change SampleLoader and subclasses to work with > new LoadFunc interface > --------------------------------------------------------------------------------------------------- > > Key: PIG-1062 > URL: https://issues.apache.org/jira/browse/PIG-1062 > Project: Pig > Issue Type: Sub-task > Reporter: Thejas M Nair > Assignee: Thejas M Nair > > This is part of the effort to implement new load store interfaces as laid out > in http://wiki.apache.org/pig/LoadStoreRedesignProposal . > PigStorage and BinStorage are now working. > SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to > be changed to work with new LoadFunc interface. > Fixing SampleLoader and RandomSampleLoader will get order-by queries working. > PoissonSampleLoader is used by skew join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.