Amir Youssefi commented on PIG-460:
We needed to make some changes to original plan above to make it work.
- Using Dynamic Sampler: Original RandomSampleLoader is a Loader so it has
access to length of slice and file skip functionality. For new sampler we had
to use dynamic sampling. Idea is to drop half of samples every time we hit max
sample size and continue taking more samples with new interval which is double
the previous interval.
- Using Operator instead of EvalFunc: Sampler doesn't need to send Samples for
each row of input. It can send null for most of those rows. Hence using an
Operator is far better than EvalFunc.
- Emission of Samples when necessary: We only need to send Samples once all
samples are taken (Samples are finalized). Unfortunately there is no way for an
Operator to tell if we hit last row of input so what we send Samples once. To
solve this issue we send all samples in a Bag and send this bag only when a new
sample is added to it. Then we use LAST() function to keep only last bag and
discard the rest. Pig doesn't have builtin LAST function so I just added it.
- Projection only on Sample rows (not all rows): There was a Projection
Operator in MRCompiler for this. We sample only a few rows and don't need to
conduct projection of columns for all rows. I pushed this projection to
POSample operator and projected required columns only on rows making it to
- Using a Pig property to determine number of samples instead of a constant
(100): Added pig.max.sample.size.per.slice to pig.properties and validator.
>From performance point of view we pay a penalty for conversion of every row
>into Tuples (before going through an operator in pipleline rows are converted
>to Tuples) and lack of access to bytes skip functionality of file . So there
>are pros/cons for applying the original idea of making 2 MR jobs instead of 3.
>After detailed discussions with Alan it was decided to go ahead with this as
>other modifications (e.g. when GROUP BY comes into picture) requires this drop
>of MR jobs from 3 to 2.
> PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
> Key: PIG-460
> URL: https://issues.apache.org/jira/browse/PIG-460
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Alan Gates
> Assignee: Amir Youssefi
> Fix For: types_branch
> Attachments: sampler.patch, sampler2.patch
> Currently order by is done in three MR jobs:
> job 1: read data in whatever loader the user requests, store using BinStorage
> job 2: load using RandomSampleLoader, find quantiles
> job 3: load data again and sort
> It is done this way because RandomSampleLoader extends BinStorage, and so
> needs the data in that format to read it.
> If the logic in RandomSampleLoader was made into an operator instead of being
> in a loader then jobs 1 and 2 could be merged. On average job 1 takes about
> 15% of the time of an order by script.
This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.