Santhosh Srinivasan commented on PIG-545:

Two, just getting better sampling won't resolve the issue for order by queries 
that have one or a few keys with a very high number of values, such as in a 
zipf distribution. Unfortunately for us, zipf is a very common data 
distribution. In this case our partitioner may need to be able to detect and 
split large keys by round robining them to a group of reducers.

Better sampling will not resolve the issue for order by. It will help in having 
more evenly sized partitions for the reducers. Since its sampling and not brute 
force approach of checking out the cardinality of each key, there will always 
be a non-zero probability of one reducer getting more data than the other 
reducers. The better sampling approach will minimize such occurrences.

Secondly, post sampling, we can ensure that reducers get the right partitions 
by using Hadoop's ability to pick reducers based on partition functions. I am 
not quite sure how Pig can propose a generic partition function to achieve this.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to