Pradeep Kamath updated PIG-545:

    Attachment: PIG-545-v3.patch

Attached a revised version of the last patch with the following changes:
1) When parallel is not specified the code now consults jobClient to get 
defaultReduces() and uses 0.9 times the value as the number of reducers (and 
hence the number of quantiles)
2) There was a bug in the way order by * was handled in MRCompiler  which is 
now fixed
3) In WeightedRangePartitioner the basic idea is to first set up the quantiles 
array as the last element of a quantile (partition). Then the code iterates 
over all the sample items and if it finds an item which equals the quantile 
element for the partition, then there is a good chance this item may repeat in 
the next quantile. The occurences of such sample items in each partition are 
recorded to use when deciding which partition such an item in the real data 
should go to. The occurences in each partition over the total occurences of 
such an element gives the probability that such an element should go to the 
given partition. In the earlier version of the patch, to set this up, the code 
was comparing a sample item with the quantile element of the next partition 
instead of the quantile element of the partition in which the sample element 
falls (since the quantile element is the last element of the partition, it 
should be used in the comparison to decide if this element is likely to 
crossover to the next partition). This has been fixed.
4) The earlier patch was not handling the case where number of samples < 
quantiles - this is handled now.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Pradeep Kamath
>             Fix For: types_branch
>         Attachments: PIG-545-v3.patch, WRP.patch, WRP1.patch
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to