David Buckley created KAFKA-14150:
-------------------------------------
Summary: Allocation of initial partitions is deterministic and
produces a leader bias when a broker is offline
Key: KAFKA-14150
URL: https://issues.apache.org/jira/browse/KAFKA-14150
Project: Kafka
Issue Type: Improvement
Reporter: David Buckley
Observation of our current cluster suggests that with N brokers, the first N
partitions are always allocated in a round-robin format with a random offset.
The preferred leader is always the first in a given replica list (and hence is
allocated round-robin, too). Subsequent brokers are allocated using some
shuffle on the list, again in a round-robin, which I think is fine and doesn't
show the bias I detail below. Suppose every topic has as many partitions as
there are brokers and replication factor of 3. Then every topic has replicas
{{N, N+1, N+2}} except where this wraps. Example:
* Topic A: 3 partitions, replicas {{012}}, {{120}}, {{201}}, leaders 0, 1, 2
* Topic B: 3 partitions, replicas {{120}}, {{201}}, {{012}}, leaders 1, 2, 0
* Topic C: 3 partitions, replicas {{201}}, {{012}}, {{120}}, leaders 2, 0, 1
This means that if broker {{x}} goes down, every partition that had {{x}} as
its preferred leader now elects {{x+1}} as its leader -- the leader allocation
were broker 1 to be offline now looks like:
* Topic A: 3 partitions, replicas {{02}}, {{20}}, {{20}}, leaders 0, 2, 2
* Topic B: 3 partitions, replicas {{20}}, {{20}}, {{02}}, leaders 2, 2, 0
* Topic C: 3 partitions, replicas {{20}}, {{02}}, {{20}}, leaders 2, 0, 2
We see that broker 2 becomes leader of 100% of the failed-over partitions, and
is now leader of 2x as many partitions as broker 0.
If there were 6 brokers, we'd see that replica sets {{02}}, {{23}} and {{50}}
would have reduced replication (and broker 4 isn't providing any redundancy for
partitions replicated in broker 1) in addition to broker 2 leading 2x as many
partitions as any other broker. Brokers 0 and 2 are now more critical than 3
and 5, which are in turn more critical than broker 4.
I'm unclear if there's any undesirable side-effects of this, but my expectation
is that the behaviour isn't really intended because subsequent partitions don't
just replicate the round-robin of the first batch. Should the allocation of the
initial partitions be completely random to avoid this bias, or is it
inconsequential?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)