[
https://issues.apache.org/jira/browse/CRUNCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14558405#comment-14558405
]
Gabriel Reid commented on CRUNCH-527:
-------------------------------------
Cool, patch is on the way.
> Improve distribution of keys when using default (hash-based) partitioning
> -------------------------------------------------------------------------
>
> Key: CRUNCH-527
> URL: https://issues.apache.org/jira/browse/CRUNCH-527
> Project: Crunch
> Issue Type: Bug
> Reporter: Gabriel Reid
> Assignee: Gabriel Reid
>
> The default partitioner used for MR-based pipelines bases itself on the hash
> code of keys modulo the number of partitions, along the lines of
> {code}int partition = key.hashCode() % numPartitions{code}
> This approach dependent on the _lower bits_ of the hash code being uniformly
> distributed. If the lower bits of the key hash code is not uniformly
> distributed, the key space will not be uniformly distributed over the
> partitions.
> It can be surprisingly easy to get a very poor distribution. For example, if
> the keys are integer values and are all divisible by 2, then only half of the
> partitions will receive data (as the hash code of an integer is the integer
> value itself).
> This can even be a problem in situations where you would really not expect
> it. For example, taking the byte-array representation of longs for each
> timestamp of each second over a period of 24 hours (at millisecond
> granularity) and partitioning it over 50 partitions results in 34 of the 50
> partitions not getting any data at all.
> The easiest way to resolve this is to have a custom HashPartitioner that
> applies a supplementary hash function to the return value of the key's
> hashCode method. This same approach is taken in java.util.HashMap for the
> same reason.
> Note that this same approach was proposed in MAPREDUCE-4827, but wasn't
> committed (mostly) because of backwards compatibility issues (some people may
> have counted on certain records showing up in a given output file). Seeing as
> Crunch is a higher abstraction above MR, I assume that we don't need to worry
> about the backwards compatibility issue as much, but there may be other
> opinions on this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)