Gabriel Reid created CRUNCH-527:
-----------------------------------

             Summary: Improve distribution of keys when using default 
(hash-based) partitioning
                 Key: CRUNCH-527
                 URL: https://issues.apache.org/jira/browse/CRUNCH-527
             Project: Crunch
          Issue Type: Bug
            Reporter: Gabriel Reid
            Assignee: Gabriel Reid


The default partitioner used for MR-based pipelines bases itself on the hash 
code of keys modulo the number of partitions, along the lines of 
{code}int partition = key.hashCode() % numPartitions{code}

This approach dependent on the _lower bits_ of the hash code being uniformly 
distributed. If the lower bits of the key hash code is not uniformly 
distributed, the key space will not be uniformly distributed over the 
partitions.

It can be surprisingly easy to get a very poor distribution. For example, if 
the keys are integer values and are all divisible by 2, then only half of the 
partitions will receive data (as the hash code of an integer is the integer 
value itself).

This can even be a problem in situations where you would really not expect it. 
For example, taking the byte-array representation of longs for each timestamp 
of each second over a period of 24 hours (at millisecond granularity) and 
partitioning it over 50 partitions results in 34 of the 50 partitions not 
getting any data at all.

The easiest way to resolve this is to have a custom HashPartitioner that 
applies a supplementary hash function to the return value of the key's hashCode 
method. This same approach is taken in java.util.HashMap for the same reason.

Note that this same approach was proposed in MAPREDUCE-4827, but wasn't 
committed (mostly) because of backwards compatibility issues (some people may 
have counted on certain records showing up in a given output file). Seeing as 
Crunch is a higher abstraction above MR, I assume that we don't need to worry 
about the backwards compatibility issue as much, but there may be other 
opinions on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to