Thejas M Nair commented on PIG-1062:

I had overlooked the fact that input size of the file is being used also to 
calculate the number of samples. Thanks for pointing it out.  

I don't know if there are any problems in using counters directly, as long as 
information is required only after (first mapreduce) sampling phase, ie it 
could be used in PartitionSkewedKey().  

The logic in PoissonSampleLoader.computeSamples is  ( a detailed explanation 
will be added soon to the sampler wiki page). - The goal is to sample all keys 
from the first input that are will need to be partitioned across multiple 
reducers in the join phase. 
Let us assume X tuples fit into available memory in reducer. Lets say we want 
to sample 10 samples in each set of X tuples, with 95% confidence. Using 
poisson distribution formulas, we arrive at the number 17 as number of tuples 
to be sampled every X tuples. ( I don't know why poisson distrubution is the 
appropriate choice )

The total number of tuples to be sampled cannot be calculated without knowing 
total number of tuples. But what we know is that we should sample one tuple 
every (X/17) tuples. To calculate X, we need the average size of tuple in 
memory. Using the process memory usage is unlikely to give good approximation 
of that, because (as per my understanding) calling the garbage collector is not 
guaranteed to free memory used by all unused objects.  Tuple.getMemorySize() 
can be used to get an estimate of the memory used by the tuple. The average 
size could be estimated/corrected as we sample more tuples.
ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input. 
(using H, s in previous comment)

In PartitionSkewedKey.exec(), Dmitriy's  idea of using number of samples, and 
sample rate (H/s) can be used to estimate total tuples. 

WeightedRangePartitioner.setConf is another function using fileSize().  That 
needs to change as well. I haven't looked at that yet.

> load-store-redesign branch: change SampleLoader and subclasses to work with 
> new LoadFunc interface 
> ---------------------------------------------------------------------------------------------------
>                 Key: PIG-1062
>                 URL: https://issues.apache.org/jira/browse/PIG-1062
>             Project: Pig
>          Issue Type: Sub-task
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
> This is part of the effort to implement new load store interfaces as laid out 
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to 
> be changed to work with new LoadFunc interface.  
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to