[ https://issues.apache.org/jira/browse/CRUNCH-527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14559259#comment-14559259 ]
Josh Wills commented on CRUNCH-527: ----------------------------------- +1 > Improve distribution of keys when using default (hash-based) partitioning > ------------------------------------------------------------------------- > > Key: CRUNCH-527 > URL: https://issues.apache.org/jira/browse/CRUNCH-527 > Project: Crunch > Issue Type: Bug > Reporter: Gabriel Reid > Assignee: Gabriel Reid > Attachments: CRUNCH-527.patch > > > The default partitioner used for MR-based pipelines bases itself on the hash > code of keys modulo the number of partitions, along the lines of > {code}int partition = key.hashCode() % numPartitions{code} > This approach dependent on the _lower bits_ of the hash code being uniformly > distributed. If the lower bits of the key hash code is not uniformly > distributed, the key space will not be uniformly distributed over the > partitions. > It can be surprisingly easy to get a very poor distribution. For example, if > the keys are integer values and are all divisible by 2, then only half of the > partitions will receive data (as the hash code of an integer is the integer > value itself). > This can even be a problem in situations where you would really not expect > it. For example, taking the byte-array representation of longs for each > timestamp of each second over a period of 24 hours (at millisecond > granularity) and partitioning it over 50 partitions results in 34 of the 50 > partitions not getting any data at all. > The easiest way to resolve this is to have a custom HashPartitioner that > applies a supplementary hash function to the return value of the key's > hashCode method. This same approach is taken in java.util.HashMap for the > same reason. > Note that this same approach was proposed in MAPREDUCE-4827, but wasn't > committed (mostly) because of backwards compatibility issues (some people may > have counted on certain records showing up in a given output file). Seeing as > Crunch is a higher abstraction above MR, I assume that we don't need to worry > about the backwards compatibility issue as much, but there may be other > opinions on this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)