Saravanan created KAFKA-20075:
---------------------------------

             Summary: Side effects of 'preferred leader election' schedule 
change is causing side effects
                 Key: KAFKA-20075
                 URL: https://issues.apache.org/jira/browse/KAFKA-20075
             Project: Kafka
          Issue Type: Task
          Components: config, controller, group-coordinator
    Affects Versions: 4.1.1, 4.1.0, 4.0.1
            Reporter: Saravanan


After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of 
preferred leader election with auto.leader.rebalance.enable=true appears to 
have changed. The effective semantics of 
leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred 
leader election for imbalanced partitions consistently occurred ~300 seconds 
after a broker failure/recovery, whereas in 4.1.0 it can occur at any time 
between 0 and 300 seconds after a broker comes back. This earlier rebalance can 
overlap with partition unloading from the old leader, causing prolonged 
consumer impact.

*In Kafka 3.9.1 KRaft:*
When a broker goes down and later comes back up, preferred leader election for 
affected partitions (e.g., __consumer_offsets) consistently happens about 5 
minutes (300 seconds) after the broker's failure/recovery sequence.


>From an operator's perspective, the effective behavior is:
_"Preferred leader election runs ~300s after the broker event."_
This aligns intuitively with leader.imbalance.check.interval.seconds=300, and 
the interval appears tied to the time when the broker failure/imbalance started.

*In Kafka 4.1.0 KRaft:*
With the same configuration (auto.leader.rebalance.enable=true, 
leader.imbalance.check.interval.seconds=300), preferred leader election is now 
driven by the new periodic task scheduler in QuorumController (e.g., 
PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
In practice, this means:
Preferred leader election can occur at any time between 0 and 300 seconds after 
a broker comes back, depending on where the controller's periodic schedule 
currently is.
The timing is no longer intuitively "300 seconds after the broker event" but 
"on the next periodic electPreferred tick," which is decoupled from broker 
failure/recovery.

This semantic change is important because of the interaction with partition 
load/unload:
When a broker that was a preferred leader comes back:
The previous leader may still be unloading partitions (closing 
producers/consumers, flushing state, checkpoints, etc.).
If preferred leader election fires early (close to the broker's return), the 
new preferred leader may start loading those same partitions while the old 
leader is still unloading them.
This overlapping unload/load window causes:
Extended recovery times for __consumer_offsets and other system topics.
Noticeable consumer-side delays and lag spikes.
Infrequent but high-impact incidents in production.
Conceptually, the change in 4.x is an improvement (cleaner periodic task 
infrastructure, explicit electPreferred task, per-broker imbalance threshold), 
but it also effectively changes the semantics of 
leader.imbalance.check.interval.seconds as understood by operators:
Previously (3.9.1), operators could treat it as "roughly how long after a 
broker event before preferred leader rebalance kicks in."
Now (4.1.0+), it is "the frequency of a global periodic check," not aligned to 
broker status changes, which leads to leader rebalances occurring much earlier 
than expected relative to broker recovery.

*Impact*
Overlapping partition unloading/loading between old and new preferred leaders.
Longer recovery and stabilization time for critical internal topics like 
__consumer_offsets.
Noticeable and sometimes severe delays for consumers during these rare but 
critical windows.
Operational confusion: existing tuning based on 3.9.1's behavior no longer 
match what's observed in 4.1.0.

*Clarifications / Requests*
Intended semantics of leader.imbalance.check.interval.seconds in 4.x
In 3.9.1, preferred leader election effectively happened ~300s after broker 
failure/recovery.
In 4.1.0, with the periodic electPreferred task, it can happen anytime between 
0-300s after a broker comes back.
Is this changed timing relative to broker events intentional?
Interaction with new imbalance logic
How do leader.imbalance.per.broker.percentage and the new KRaft controller 
logic influence when preferred leader election is triggered (beyond the 
periodic task)?
Are there now event-driven triggers that can cause earlier rebalancing than the 
configured interval?
Operational guidance to avoid overlap/unload issues
What is the recommended way in 4.1.0+ to avoid preferred leader election 
overlapping with partition unloading on the old leader (and loading on the new 
one) after broker recovery?
Should operators tune leader.imbalance.per.broker.percentage, 
leader.imbalance.check.interval.seconds, or use another mechanism to delay 
automatic preferred leader rebalance after a broker comes back?
Documentation expectations for upgrades
If the new behavior is expected, can the docs explicitly state that 
leader.imbalance.check.interval.seconds is a periodic scheduler interval, not a 
post-broker-event delay, and that actual rebalance relative to broker events 
may occur anywhere between 0 and the configured interval?
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to