[
https://issues.apache.org/jira/browse/KAFKA-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079239#comment-18079239
]
Saravanan commented on KAFKA-20075:
-----------------------------------
[~showuon]
Logs from 3.9.1. You can see '{*}maybeBalancePartitionLeaders{*}' triggering
at 18:19 which is around 5 mins after broker 0 was unfenced at 18:14. There
was no subsequent 'maybeBalancePartitionLeaders' log after that as there were
no broker restarts.
{code:java}
[2026-05-07 18:14:59,160] INFO [QuorumController id=1000] The request from
broker 0 to unfence has been granted because it has caught up with the offset
of its register broker record 354812.
(org.apache.kafka.controller.BrokerHeartbeatManager)
[2026-05-07 18:14:59,161] INFO [QuorumController id=1000] handleBrokerUnfenced:
changing 3 partition(s) (org.apache.kafka.controller.ReplicationControlManager)
[2026-05-07 18:14:59,161] INFO [QuorumController id=1000] Replayed
BrokerRegistrationChangeRecord modifying the registration for broker 0:
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=354812, fenced=-1,
inControlledShutdown=0, logDirs=[])
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:19:38,258] INFO [QuorumController id=1000]
maybeBalancePartitionLeaders: generated 1000 records in 2951 microseconds.
Rescheduling immediately. (org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,264] INFO [QuorumController id=1000]
maybeElectUncleanLeaders: generated 0 records in 42 microseconds.
(org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,277] INFO [QuorumController id=1000]
maybeBalancePartitionLeaders: generated 1000 records in 3069 microseconds.
Rescheduling immediately. (org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,293] INFO [QuorumController id=1000]
maybeBalancePartitionLeaders: generated 466 records in 1511 microseconds.
(org.apache.kafka.controller.QuorumController)
[2026-05-07 18:23:03,141] INFO [SnapshotGenerator id=1000] Creating new KRaft
snapshot file snapshot 00000000000000363229-0000000003 because we have waited
at least 60 minute(s). (org.apache.kafka.image.publisher.SnapshotGenerator)
[2026-05-07 18:23:03,154] INFO [SnapshotEmitter id=1000] Successfully wrote
snapshot 00000000000000363229-0000000003
(org.apache.kafka.image.publisher.SnapshotEmitter) {code}
compare this with 4.1.1 log below. The fact that the logs being different
confirms the code change in the periodic scheduler. Here the corresponding log
'{*}Periodic task electPreferred{*}' was triggered (18:50:57) within 1:30 mins
after broker 0 was unfenced (18:50:28). Also, you can see periodic task
*electPreferred* triggering every 5 mins in a schedule even without broker
restart. This was not the case in 3.9.1.
btw, controller.log was also introduced after 3.9.1.
{code:java}
[2026-05-07 18:50:28,676] INFO [QuorumController id=1000] The request from
broker 0 to unfence has been granted because it has caught up with the offset
of its register broker record 32430622.
(org.apache.kafka.controller.BrokerHeartbeatManager)
[2026-05-07 18:50:28,676] INFO [QuorumController id=1000] Replayed
BrokerRegistrationChangeRecord modifying the registration for broker 0:
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1,
inControlledShutdown=0, logDirs=[])
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:28,267] INFO [QuorumController id=1002] Replayed
BrokerRegistrationChangeRecord modifying the registration for broker 0:
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1,
inControlledShutdown=0, logDirs=[])
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:28,711] INFO [QuorumController id=1001] Replayed
BrokerRegistrationChangeRecord modifying the registration for broker 0:
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1,
inControlledShutdown=0, logDirs=[])
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:57,729] INFO [QuorumController id=1000] In the last 60000 ms
period, 397 controller events were completed, which took an average of 11.04 ms
each. The slowest event was processBrokerHeartbeat(227066699), which took
112.58 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:51:57,724] INFO [QuorumController id=1000] Periodic task
electPreferred generated 284 records in 2258 microseconds.
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:51:57,728] INFO [QuorumController id=1000] Periodic task
electUnclean generated 0 records in 12 microseconds.
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:51:57,730] INFO [QuorumController id=1000] In the last 60000 ms
period, 385 controller events were completed, which took an average of 9.62 ms
each. The slowest event was writeNoOpRecord(1063180442), which took 35.83 ms.
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:52:57,732] INFO [QuorumController id=1000] In the last 60000 ms
period, 387 controller events were completed, which took an average of 10.35 ms
each. The slowest event was writeNoOpRecord(762440656), which took 67.26 ms.
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:53:57,732] INFO [QuorumController id=1000] In the last 60000 ms
period, 384 controller events were completed, which took an average of 9.96 ms
each. The slowest event was writeNoOpRecord(26319476), which took 35.72 ms.
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:54:57,733] INFO [QuorumController id=1000] In the last 60000 ms
period, 384 controller events were completed, which took an average of 9.39 ms
each. The slowest event was writeNoOpRecord(1926412140), which took 52.72 ms.
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:55:57,733] INFO [QuorumController id=1000] In the last 60000 ms
period, 384 controller events were completed, which took an average of 9.62 ms
each. The slowest event was writeNoOpRecord(737652369), which took 38.10 ms.
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:56:57,725] INFO [QuorumController id=1000] Periodic task
electPreferred generated 0 records in 34 microseconds.
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:56:57,728] INFO [QuorumController id=1000] Periodic task
electUnclean generated 0 records in 47 microseconds.
(org.apache.kafka.controller.PeriodicTaskControlManager) {code}
> Side effects of 'preferred leader election' schedule change
> -----------------------------------------------------------
>
> Key: KAFKA-20075
> URL: https://issues.apache.org/jira/browse/KAFKA-20075
> Project: Kafka
> Issue Type: Task
> Components: config, controller, group-coordinator
> Affects Versions: 4.0.1, 4.1.0, 4.1.1
> Reporter: Saravanan
> Assignee: Jonah Hooper
> Priority: Major
> Labels: kraft, leader-imbalance, preferred-leader-election
>
> After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of
> preferred leader election with auto.leader.rebalance.enable=true appears to
> have changed. The effective semantics of
> leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred
> leader election for imbalanced partitions consistently occurred ~300 seconds
> after a broker failure/recovery, whereas in 4.1.0 it can occur at any time
> between 0 and 300 seconds after a broker comes back. This earlier rebalance
> can overlap with partition unloading from the old leader, causing prolonged
> consumer impact.
> *In Kafka 3.9.1 KRaft:*
> When a broker goes down and later comes back up, preferred leader election
> for affected partitions (e.g., __consumer_offsets) consistently happens about
> 5 minutes (300 seconds) after the broker's failure/recovery sequence.
> From an operator's perspective, the effective behavior is:
> _"Preferred leader election runs ~300s after the broker event."_
> This aligns intuitively with leader.imbalance.check.interval.seconds=300, and
> the interval appears tied to the time when the broker failure/imbalance
> started.
> *In Kafka 4.1.0 KRaft:*
> With the same configuration (auto.leader.rebalance.enable=true,
> leader.imbalance.check.interval.seconds=300), preferred leader election is
> now driven by the new periodic task scheduler in QuorumController (e.g.,
> PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
> In practice, this means:
> Preferred leader election can occur at any time between 0 and 300 seconds
> after a broker comes back, depending on where the controller's periodic
> schedule currently is.
> The timing is no longer intuitively "300 seconds after the broker event" but
> "on the next periodic electPreferred tick," which is decoupled from broker
> failure/recovery.
> This semantic change is important because of the interaction with partition
> load/unload:
> When a broker that was a preferred leader comes back:
> The previous leader may still be unloading partitions (closing
> producers/consumers, flushing state, checkpoints, etc.).
> If preferred leader election fires early (close to the broker's return), the
> new preferred leader may start loading those same partitions while the old
> leader is still unloading them.
> This overlapping unload/load window causes:
> Extended recovery times for __consumer_offsets and other system topics.
> Noticeable consumer-side delays and lag spikes.
> Infrequent but high-impact incidents in production.
> Conceptually, the change in 4.x is an improvement (cleaner periodic task
> infrastructure, explicit electPreferred task, per-broker imbalance
> threshold), but it also effectively changes the semantics of
> leader.imbalance.check.interval.seconds as understood by operators:
> Previously (3.9.1), operators could treat it as "roughly how long after a
> broker event before preferred leader rebalance kicks in."
> Now (4.1.0+), it is "the frequency of a global periodic check," not aligned
> to broker status changes, which leads to leader rebalances occurring much
> earlier than expected relative to broker recovery.
> *Impact*
> Overlapping partition unloading/loading between old and new preferred leaders.
> Longer recovery and stabilization time for critical internal topics like
> __consumer_offsets.
> Noticeable and sometimes severe delays for consumers during these rare but
> critical windows.
> Operational confusion: existing tuning based on 3.9.1's behavior no longer
> match what's observed in 4.1.0.
> *Clarifications / Requests*
> Intended semantics of leader.imbalance.check.interval.seconds in 4.x
> In 3.9.1, preferred leader election effectively happened ~300s after broker
> failure/recovery.
> In 4.1.0, with the periodic electPreferred task, it can happen anytime
> between 0-300s after a broker comes back.
> Is this changed timing relative to broker events intentional?
> Interaction with new imbalance logic
> How do leader.imbalance.per.broker.percentage and the new KRaft controller
> logic influence when preferred leader election is triggered (beyond the
> periodic task)?
> Are there now event-driven triggers that can cause earlier rebalancing than
> the configured interval?
> Operational guidance to avoid overlap/unload issues
> What is the recommended way in 4.1.0+ to avoid preferred leader election
> overlapping with partition unloading on the old leader (and loading on the
> new one) after broker recovery?
> Should operators tune leader.imbalance.per.broker.percentage,
> leader.imbalance.check.interval.seconds, or use another mechanism to delay
> automatic preferred leader rebalance after a broker comes back?
> Documentation expectations for upgrades
> If the new behavior is expected, can the docs explicitly state that
> leader.imbalance.check.interval.seconds is a periodic scheduler interval, not
> a post-broker-event delay, and that actual rebalance relative to broker
> events may occur anywhere between 0 and the configured interval?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)