[ 
https://issues.apache.org/jira/browse/CRUNCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Reid resolved CRUNCH-527.
---------------------------------
    Resolution: Fixed

Pushed to master

> Improve distribution of keys when using default (hash-based) partitioning
> -------------------------------------------------------------------------
>
>                 Key: CRUNCH-527
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-527
>             Project: Crunch
>          Issue Type: Bug
>            Reporter: Gabriel Reid
>            Assignee: Gabriel Reid
>         Attachments: CRUNCH-527.patch
>
>
> The default partitioner used for MR-based pipelines bases itself on the hash 
> code of keys modulo the number of partitions, along the lines of 
> {code}int partition = key.hashCode() % numPartitions{code}
> This approach dependent on the _lower bits_ of the hash code being uniformly 
> distributed. If the lower bits of the key hash code is not uniformly 
> distributed, the key space will not be uniformly distributed over the 
> partitions.
> It can be surprisingly easy to get a very poor distribution. For example, if 
> the keys are integer values and are all divisible by 2, then only half of the 
> partitions will receive data (as the hash code of an integer is the integer 
> value itself).
> This can even be a problem in situations where you would really not expect 
> it. For example, taking the byte-array representation of longs for each 
> timestamp of each second over a period of 24 hours (at millisecond 
> granularity) and partitioning it over 50 partitions results in 34 of the 50 
> partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that 
> applies a supplementary hash function to the return value of the key's 
> hashCode method. This same approach is taken in java.util.HashMap for the 
> same reason.
> Note that this same approach was proposed in MAPREDUCE-4827, but wasn't 
> committed (mostly) because of backwards compatibility issues (some people may 
> have counted on certain records showing up in a given output file). Seeing as 
> Crunch is a higher abstraction above MR, I assume that we don't need to worry 
> about the backwards compatibility issue as much, but there may be other 
> opinions on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to