Olga Natkovich commented on PIG-871:

Hi Ankur,

Thanks for logging the bug. Before we decide on the solution we need to run 
some tests that actually show that we do have a problem. We know we have an 
issue with JOIN with large keys that spill. We are almost done with an 
implementing a solution in this case. We also similarly addressed the case of 
large keys in order by.

It would be very interesting to come up with queries that show the behavior for 
the case of group by queries. If you have such examples, please, post them here.

> Improve distribution of keys in reduce phase
> --------------------------------------------
>                 Key: PIG-871
>                 URL: https://issues.apache.org/jira/browse/PIG-871
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.3.0
>            Reporter: Ankur
> The default hashing scheme used to distribute keys in reduce phase sometimes 
> results in an uneven distribution of keys resulting in 5 - 10 % of reducers 
> being overloaded with data. This bottleneck makes the PIG jobs really slow 
> and gives users a bad impression.
> While there is no bullet proof solution to the problem in general, the 
> hashing can certainly be improved for better distribution. The proposal here 
> is to evaluate and incorporate other hashing schemes that give high avalanche 
> and more even distribution. We can start by evaluating MurmurHash which is 
> Apache 2.0 licensed and freely available here - 
> http://www.getopt.org/murmur/MurmurHash.java

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to