[jira] [Created] (KAFKA-2223) Improve distribution of data when using hash-based partitioning

Gabriel Reid (JIRA) Wed, 27 May 2015 00:03:13 -0700

Gabriel Reid created KAFKA-2223:
-----------------------------------

             Summary: Improve distribution of data when using hash-based 
partitioning
                 Key: KAFKA-2223
                 URL: https://issues.apache.org/jira/browse/KAFKA-2223
             Project: Kafka
          Issue Type: Improvement
            Reporter: Gabriel Reid



Both the DefaultPartitioner and ByteArrayPartitioner base themselves on the 
hash code of keys modulo the number of partitions, along the lines of 
{code}partition = key.hashCode() % numPartitions{code} (converting to absolute 
value is ommitted here)

This approach is entirely dependent on the _lower bits_ of the hash code being 
uniformly distributed in order to get good distribution of records over 
multiple partitions. If the lower bits of the key hash code are not uniformly 
distributed, the key space will not be uniformly distributed over the 
partitions.

It can be surprisingly easy to get a very poor distribution. As a simple 
example, if the keys are integer values and are all divisible by 2, then only 
half of the partitions will receive data (as the hash code of an integer is the 
integer value itself).

This can even be a problem in situations where you would really not expect it. 
For example, taking the 8-byte big-endian byte-array representation of longs 
for each timestamp of each second over a period of 24 hours (at millisecond 
granularity) and partitioning it over 50 partitions results in 34 of the 50 
partitions not getting any data at all.

The easiest way to resolve this is to have a custom HashPartitioner that 
applies a supplementary hash function to the return value of the key's hashCode 
method. This same approach is taken in java.util.HashMap for the exact same 
reason.

One potential issue for a change like this to the default partitioner could be 
backward compatibility, if there is some kind of logic expecting that a given 
key would be sent to a given partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2223) Improve distribution of data when using hash-based partitioning

Reply via email to