Saravanan created KAFKA-20075:
---------------------------------
Summary: Side effects of 'preferred leader election' schedule
change is causing side effects
Key: KAFKA-20075
URL: https://issues.apache.org/jira/browse/KAFKA-20075
Project: Kafka
Issue Type: Task
Components: config, controller, group-coordinator
Affects Versions: 4.1.1, 4.1.0, 4.0.1
Reporter: Saravanan
After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of
preferred leader election with auto.leader.rebalance.enable=true appears to
have changed. The effective semantics of
leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred
leader election for imbalanced partitions consistently occurred ~300 seconds
after a broker failure/recovery, whereas in 4.1.0 it can occur at any time
between 0 and 300 seconds after a broker comes back. This earlier rebalance can
overlap with partition unloading from the old leader, causing prolonged
consumer impact.
*In Kafka 3.9.1 KRaft:*
When a broker goes down and later comes back up, preferred leader election for
affected partitions (e.g., __consumer_offsets) consistently happens about 5
minutes (300 seconds) after the broker's failure/recovery sequence.
>From an operator's perspective, the effective behavior is:
_"Preferred leader election runs ~300s after the broker event."_
This aligns intuitively with leader.imbalance.check.interval.seconds=300, and
the interval appears tied to the time when the broker failure/imbalance started.
*In Kafka 4.1.0 KRaft:*
With the same configuration (auto.leader.rebalance.enable=true,
leader.imbalance.check.interval.seconds=300), preferred leader election is now
driven by the new periodic task scheduler in QuorumController (e.g.,
PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
In practice, this means:
Preferred leader election can occur at any time between 0 and 300 seconds after
a broker comes back, depending on where the controller's periodic schedule
currently is.
The timing is no longer intuitively "300 seconds after the broker event" but
"on the next periodic electPreferred tick," which is decoupled from broker
failure/recovery.
This semantic change is important because of the interaction with partition
load/unload:
When a broker that was a preferred leader comes back:
The previous leader may still be unloading partitions (closing
producers/consumers, flushing state, checkpoints, etc.).
If preferred leader election fires early (close to the broker's return), the
new preferred leader may start loading those same partitions while the old
leader is still unloading them.
This overlapping unload/load window causes:
Extended recovery times for __consumer_offsets and other system topics.
Noticeable consumer-side delays and lag spikes.
Infrequent but high-impact incidents in production.
Conceptually, the change in 4.x is an improvement (cleaner periodic task
infrastructure, explicit electPreferred task, per-broker imbalance threshold),
but it also effectively changes the semantics of
leader.imbalance.check.interval.seconds as understood by operators:
Previously (3.9.1), operators could treat it as "roughly how long after a
broker event before preferred leader rebalance kicks in."
Now (4.1.0+), it is "the frequency of a global periodic check," not aligned to
broker status changes, which leads to leader rebalances occurring much earlier
than expected relative to broker recovery.
*Impact*
Overlapping partition unloading/loading between old and new preferred leaders.
Longer recovery and stabilization time for critical internal topics like
__consumer_offsets.
Noticeable and sometimes severe delays for consumers during these rare but
critical windows.
Operational confusion: existing tuning based on 3.9.1's behavior no longer
match what's observed in 4.1.0.
*Clarifications / Requests*
Intended semantics of leader.imbalance.check.interval.seconds in 4.x
In 3.9.1, preferred leader election effectively happened ~300s after broker
failure/recovery.
In 4.1.0, with the periodic electPreferred task, it can happen anytime between
0-300s after a broker comes back.
Is this changed timing relative to broker events intentional?
Interaction with new imbalance logic
How do leader.imbalance.per.broker.percentage and the new KRaft controller
logic influence when preferred leader election is triggered (beyond the
periodic task)?
Are there now event-driven triggers that can cause earlier rebalancing than the
configured interval?
Operational guidance to avoid overlap/unload issues
What is the recommended way in 4.1.0+ to avoid preferred leader election
overlapping with partition unloading on the old leader (and loading on the new
one) after broker recovery?
Should operators tune leader.imbalance.per.broker.percentage,
leader.imbalance.check.interval.seconds, or use another mechanism to delay
automatic preferred leader rebalance after a broker comes back?
Documentation expectations for upgrades
If the new behavior is expected, can the docs explicitly state that
leader.imbalance.check.interval.seconds is a periodic scheduler interval, not a
post-broker-event delay, and that actual rebalance relative to broker events
may occur anywhere between 0 and the configured interval?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)