Shravan Matthur Narayanamurthy updated PIG-545:

    Attachment: WRP.patch

This patch implements the Weighted Range Partitioner as detailed in the Dewitt 
et. al. paper on Practical Skew Handling in Parallel Joins. The 
JobControlCompiler has been modified to use the new partitioner for order by. 
So the old unit tests should be valid.

One caveat is that we need to mention the number of reducers via the parallel 
keyword when doing order by. Currently, if you don't specify it by default 
there will just be one partition and it messes up the distribution. We need to 
do something about this. Another thing is when the Partitioner gets configured 
it reads the entire sample file from HDFS but it currently doesn't do any 
reporting as I could not think of a way to do it right now

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>         Attachments: WRP.patch
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to