Alan Gates commented on PIG-545:

I ran the pigmix queries L9 (order by a single column) and L10 (order by 
multiple columns) and found some interesting results.

For L9, the total ordering job (job 3), took 587 seconds.  Min and max times 
for individual reducers were 92 and 589 seconds (I'm not sure how 1 reducer ran 
2 sec longer than total job time, but all these numbers come from the hadoop 
web ui).  Seven of the 40 reducers (including the 92 second one) received no 
records to sort.  The long running 589 second job received one key, which had 
2M values.

For L10, the total ordering job took 238 seconds.  Min and max times for 
individual reducers were 99 seconds (3 keys, 32K records) and 232 seconds (413K 
keys, 496K records).

>From this I draw a couple of conclusions:  

One, our order by partitioner could be better built.  There is no reason a 
reducer should ever receive 0 records.  And in a job with 3 uncorrelated keys 
we still see a > 10x disparity in data distribution.  The partitioner needs to 
do a better job of producing even distributions of the keys to reducers.

Two, just getting better sampling won't resolve the issue for order by queries 
that have one or a few keys with a very high number of values, such as in a 
zipf distribution.  Unfortunately for us, zipf is a very common data 
distribution.  In this case our partitioner may need to be able to detect and 
split large keys by round robining them to a group of reducers.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
> In running tests on actual data, I've noticed that the final reduce of an 
> order by has skewed partitions.  Some reduces finish in a few seconds while 
> some run for 20 minutes.  Getting a better distribution should lead to much 
> better performance for order by.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to