[ 
https://issues.apache.org/jira/browse/KAFKA-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18079239#comment-18079239
 ] 

Saravanan commented on KAFKA-20075:
-----------------------------------

[~showuon]  

Logs from 3.9.1.  You can see  '{*}maybeBalancePartitionLeaders{*}' triggering 
at 18:19  which is around 5 mins after broker 0 was unfenced at 18:14.  There 
was no subsequent 'maybeBalancePartitionLeaders' log after that as there were 
no broker restarts.
{code:java}
[2026-05-07 18:14:59,160] INFO [QuorumController id=1000] The request from 
broker 0 to unfence has been granted because it has caught up with the offset 
of its register broker record 354812. 
(org.apache.kafka.controller.BrokerHeartbeatManager)
[2026-05-07 18:14:59,161] INFO [QuorumController id=1000] handleBrokerUnfenced: 
changing 3 partition(s) (org.apache.kafka.controller.ReplicationControlManager)
[2026-05-07 18:14:59,161] INFO [QuorumController id=1000] Replayed 
BrokerRegistrationChangeRecord modifying the registration for broker 0: 
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=354812, fenced=-1, 
inControlledShutdown=0, logDirs=[]) 
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:19:38,258] INFO [QuorumController id=1000] 
maybeBalancePartitionLeaders: generated 1000 records in 2951 microseconds. 
Rescheduling immediately. (org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,264] INFO [QuorumController id=1000] 
maybeElectUncleanLeaders: generated 0 records in 42 microseconds. 
(org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,277] INFO [QuorumController id=1000] 
maybeBalancePartitionLeaders: generated 1000 records in 3069 microseconds. 
Rescheduling immediately. (org.apache.kafka.controller.QuorumController)
[2026-05-07 18:19:38,293] INFO [QuorumController id=1000] 
maybeBalancePartitionLeaders: generated 466 records in 1511 microseconds. 
(org.apache.kafka.controller.QuorumController)
[2026-05-07 18:23:03,141] INFO [SnapshotGenerator id=1000] Creating new KRaft 
snapshot file snapshot 00000000000000363229-0000000003 because we have waited 
at least 60 minute(s). (org.apache.kafka.image.publisher.SnapshotGenerator)
[2026-05-07 18:23:03,154] INFO [SnapshotEmitter id=1000] Successfully wrote 
snapshot 00000000000000363229-0000000003 
(org.apache.kafka.image.publisher.SnapshotEmitter) {code}

compare this with 4.1.1 log below.  The fact that the logs being different 
confirms the code change in the periodic scheduler.  Here the corresponding log 
'{*}Periodic task electPreferred{*}' was triggered (18:50:57) within 1:30 mins 
after broker 0 was unfenced (18:50:28).  Also, you can see periodic task 
*electPreferred* triggering every 5 mins in a schedule even without broker 
restart.  This was not the case in 3.9.1.  

btw, controller.log was also introduced after 3.9.1.   


{code:java}
[2026-05-07 18:50:28,676] INFO [QuorumController id=1000] The request from 
broker 0 to unfence has been granted because it has caught up with the offset 
of its register broker record 32430622. 
(org.apache.kafka.controller.BrokerHeartbeatManager)
[2026-05-07 18:50:28,676] INFO [QuorumController id=1000] Replayed 
BrokerRegistrationChangeRecord modifying the registration for broker 0: 
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1, 
inControlledShutdown=0, logDirs=[]) 
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:28,267] INFO [QuorumController id=1002] Replayed 
BrokerRegistrationChangeRecord modifying the registration for broker 0: 
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1, 
inControlledShutdown=0, logDirs=[]) 
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:28,711] INFO [QuorumController id=1001] Replayed 
BrokerRegistrationChangeRecord modifying the registration for broker 0: 
BrokerRegistrationChangeRecord(brokerId=0, brokerEpoch=32430622, fenced=-1, 
inControlledShutdown=0, logDirs=[]) 
(org.apache.kafka.controller.ClusterControlManager)
[2026-05-07 18:50:57,729] INFO [QuorumController id=1000] In the last 60000 ms 
period, 397 controller events were completed, which took an average of 11.04 ms 
each. The slowest event was processBrokerHeartbeat(227066699), which took 
112.58 ms. (org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:51:57,724] INFO [QuorumController id=1000] Periodic task 
electPreferred generated 284 records in 2258 microseconds. 
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:51:57,728] INFO [QuorumController id=1000] Periodic task 
electUnclean generated 0 records in 12 microseconds. 
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:51:57,730] INFO [QuorumController id=1000] In the last 60000 ms 
period, 385 controller events were completed, which took an average of 9.62 ms 
each. The slowest event was writeNoOpRecord(1063180442), which took 35.83 ms. 
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:52:57,732] INFO [QuorumController id=1000] In the last 60000 ms 
period, 387 controller events were completed, which took an average of 10.35 ms 
each. The slowest event was writeNoOpRecord(762440656), which took 67.26 ms. 
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:53:57,732] INFO [QuorumController id=1000] In the last 60000 ms 
period, 384 controller events were completed, which took an average of 9.96 ms 
each. The slowest event was writeNoOpRecord(26319476), which took 35.72 ms. 
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:54:57,733] INFO [QuorumController id=1000] In the last 60000 ms 
period, 384 controller events were completed, which took an average of 9.39 ms 
each. The slowest event was writeNoOpRecord(1926412140), which took 52.72 ms. 
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:55:57,733] INFO [QuorumController id=1000] In the last 60000 ms 
period, 384 controller events were completed, which took an average of 9.62 ms 
each. The slowest event was writeNoOpRecord(737652369), which took 38.10 ms. 
(org.apache.kafka.controller.EventPerformanceMonitor)
[2026-05-07 18:56:57,725] INFO [QuorumController id=1000] Periodic task 
electPreferred generated 0 records in 34 microseconds. 
(org.apache.kafka.controller.PeriodicTaskControlManager)
[2026-05-07 18:56:57,728] INFO [QuorumController id=1000] Periodic task 
electUnclean generated 0 records in 47 microseconds. 
(org.apache.kafka.controller.PeriodicTaskControlManager) {code}

> Side effects of 'preferred leader election' schedule change
> -----------------------------------------------------------
>
>                 Key: KAFKA-20075
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20075
>             Project: Kafka
>          Issue Type: Task
>          Components: config, controller, group-coordinator
>    Affects Versions: 4.0.1, 4.1.0, 4.1.1
>            Reporter: Saravanan
>            Assignee: Jonah Hooper
>            Priority: Major
>              Labels: kraft, leader-imbalance, preferred-leader-election
>
> After upgrading from Kafka 3.9.1 to 4.1.0 in KRaft mode, the behavior of 
> preferred leader election with auto.leader.rebalance.enable=true appears to 
> have changed. The effective semantics of 
> leader.imbalance.check.interval.seconds are different: in 3.9.1, preferred 
> leader election for imbalanced partitions consistently occurred ~300 seconds 
> after a broker failure/recovery, whereas in 4.1.0 it can occur at any time 
> between 0 and 300 seconds after a broker comes back. This earlier rebalance 
> can overlap with partition unloading from the old leader, causing prolonged 
> consumer impact.
> *In Kafka 3.9.1 KRaft:*
> When a broker goes down and later comes back up, preferred leader election 
> for affected partitions (e.g., __consumer_offsets) consistently happens about 
> 5 minutes (300 seconds) after the broker's failure/recovery sequence.
> From an operator's perspective, the effective behavior is:
> _"Preferred leader election runs ~300s after the broker event."_
> This aligns intuitively with leader.imbalance.check.interval.seconds=300, and 
> the interval appears tied to the time when the broker failure/imbalance 
> started.
> *In Kafka 4.1.0 KRaft:*
> With the same configuration (auto.leader.rebalance.enable=true, 
> leader.imbalance.check.interval.seconds=300), preferred leader election is 
> now driven by the new periodic task scheduler in QuorumController (e.g., 
> PeriodicTask("electPreferred", ...)), plus per‑broker imbalance logic.
> In practice, this means:
> Preferred leader election can occur at any time between 0 and 300 seconds 
> after a broker comes back, depending on where the controller's periodic 
> schedule currently is.
> The timing is no longer intuitively "300 seconds after the broker event" but 
> "on the next periodic electPreferred tick," which is decoupled from broker 
> failure/recovery.
> This semantic change is important because of the interaction with partition 
> load/unload:
> When a broker that was a preferred leader comes back:
> The previous leader may still be unloading partitions (closing 
> producers/consumers, flushing state, checkpoints, etc.).
> If preferred leader election fires early (close to the broker's return), the 
> new preferred leader may start loading those same partitions while the old 
> leader is still unloading them.
> This overlapping unload/load window causes:
> Extended recovery times for __consumer_offsets and other system topics.
> Noticeable consumer-side delays and lag spikes.
> Infrequent but high-impact incidents in production.
> Conceptually, the change in 4.x is an improvement (cleaner periodic task 
> infrastructure, explicit electPreferred task, per-broker imbalance 
> threshold), but it also effectively changes the semantics of 
> leader.imbalance.check.interval.seconds as understood by operators:
> Previously (3.9.1), operators could treat it as "roughly how long after a 
> broker event before preferred leader rebalance kicks in."
> Now (4.1.0+), it is "the frequency of a global periodic check," not aligned 
> to broker status changes, which leads to leader rebalances occurring much 
> earlier than expected relative to broker recovery.
> *Impact*
> Overlapping partition unloading/loading between old and new preferred leaders.
> Longer recovery and stabilization time for critical internal topics like 
> __consumer_offsets.
> Noticeable and sometimes severe delays for consumers during these rare but 
> critical windows.
> Operational confusion: existing tuning based on 3.9.1's behavior no longer 
> match what's observed in 4.1.0.
> *Clarifications / Requests*
> Intended semantics of leader.imbalance.check.interval.seconds in 4.x
> In 3.9.1, preferred leader election effectively happened ~300s after broker 
> failure/recovery.
> In 4.1.0, with the periodic electPreferred task, it can happen anytime 
> between 0-300s after a broker comes back.
> Is this changed timing relative to broker events intentional?
> Interaction with new imbalance logic
> How do leader.imbalance.per.broker.percentage and the new KRaft controller 
> logic influence when preferred leader election is triggered (beyond the 
> periodic task)?
> Are there now event-driven triggers that can cause earlier rebalancing than 
> the configured interval?
> Operational guidance to avoid overlap/unload issues
> What is the recommended way in 4.1.0+ to avoid preferred leader election 
> overlapping with partition unloading on the old leader (and loading on the 
> new one) after broker recovery?
> Should operators tune leader.imbalance.per.broker.percentage, 
> leader.imbalance.check.interval.seconds, or use another mechanism to delay 
> automatic preferred leader rebalance after a broker comes back?
> Documentation expectations for upgrades
> If the new behavior is expected, can the docs explicitly state that 
> leader.imbalance.check.interval.seconds is a periodic scheduler interval, not 
> a post-broker-event delay, and that actual rebalance relative to broker 
> events may occur anywhere between 0 and the configured interval?
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to