[
https://issues.apache.org/jira/browse/PIG-1062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772197#action_12772197
]
Thejas M Nair commented on PIG-1062:
------------------------------------
Dmitriy,
I had overlooked the fact that input size of the file is being used also to
calculate the number of samples. Thanks for pointing it out.
I don't know if there are any problems in using counters directly, as long as
information is required only after (first mapreduce) sampling phase, ie it
could be used in PartitionSkewedKey().
The logic in PoissonSampleLoader.computeSamples is ( a detailed explanation
will be added soon to the sampler wiki page). - The goal is to sample all keys
from the first input that are will need to be partitioned across multiple
reducers in the join phase.
Let us assume X tuples fit into available memory in reducer. Lets say we want
to sample 10 samples in each set of X tuples, with 95% confidence. Using
poisson distribution formulas, we arrive at the number 17 as number of tuples
to be sampled every X tuples. ( I don't know why poisson distrubution is the
appropriate choice )
The total number of tuples to be sampled cannot be calculated without knowing
total number of tuples. But what we know is that we should sample one tuple
every (X/17) tuples. To calculate X, we need the average size of tuple in
memory. Using the process memory usage is unlikely to give good approximation
of that, because (as per my understanding) calling the garbage collector is not
guaranteed to free memory used by all unused objects. Tuple.getMemorySize()
can be used to get an estimate of the memory used by the tuple. The average
size could be estimated/corrected as we sample more tuples.
ie, PoissonSampleLoader.getNext() will return every H/s tuple in the input.
(using H, s in previous comment)
In PartitionSkewedKey.exec(), Dmitriy's idea of using number of samples, and
sample rate (H/s) can be used to estimate total tuples.
WeightedRangePartitioner.setConf is another function using fileSize(). That
needs to change as well. I haven't looked at that yet.
> load-store-redesign branch: change SampleLoader and subclasses to work with
> new LoadFunc interface
> ---------------------------------------------------------------------------------------------------
>
> Key: PIG-1062
> URL: https://issues.apache.org/jira/browse/PIG-1062
> Project: Pig
> Issue Type: Sub-task
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
>
> This is part of the effort to implement new load store interfaces as laid out
> in http://wiki.apache.org/pig/LoadStoreRedesignProposal .
> PigStorage and BinStorage are now working.
> SampleLoader and subclasses -RandomSampleLoader, PoissonSampleLoader need to
> be changed to work with new LoadFunc interface.
> Fixing SampleLoader and RandomSampleLoader will get order-by queries working.
> PoissonSampleLoader is used by skew join.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.