[jira] [Comment Edited] (KAFKA-19148) Potential Unclean Leader Election in KRaft Despite unclean.leader.election.enable=false

Emanuel Mena (Jira) Sun, 20 Jul 2025 04:35:06 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-19148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18008212#comment-18008212
 ]


Emanuel Mena edited comment on KAFKA-19148 at 7/20/25 11:34 AM:
----------------------------------------------------------------

Hey, [~azhar2407] and [~julianbergner] I noticed this issue happening in our 
clusters as well.
We are using Kafka with KRaft version 3.9.0, and this happens to us on any 
action we do with CC (add broker, remove broker or rebalance the cluster).
I tried debugging it, but couldn't really pinpoint where the issue was.
Things I could gather:
{code:java}
2025-07-16 10:43:50,852 INFO [QuorumController id=100] Replayed partition 
assignment change PartitionChangeRecord(partitionId=9, 
topicId=ZanFWE05S6aTL-yqdhGu2Q, isr=null, leader=-2, replicas=[0, 5, 7, 12], 
removingReplicas=[12], addingReplicas=[0], leaderRecoveryState=-1, 
directories=[cWMT3MFTB0JnBFKnqfobDQ, ctS9lAE9khloi7vrRQjg9Q, 
6tr7kMA4lP--DRX0bGt_Zw, PqG7ctFYosiON13SP1RblA], eligibleLeaderReplicas=null, 
lastKnownElr=null) for topic leefa-audit 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-100-event-handler] {code}
The cluster tries to remove the leader partition without triggering a leader 
change (the isr is null for some reason).
After which, the UNCLEAN election is triggered 
{code:java}
2025-07-16 10:44:01,851 INFO [QuorumController id=100] UNCLEAN partition change 
for leefa-audit-9 with topic ID ZanFWE05S6aTL-yqdhGu2Q: replicas: [0, 5, 7, 12] 
-> [0, 5, 7], directories: [cWMT3MFTB0JnBFKnqfobDQ, ctS9lAE9khloi7vrRQjg9Q, 
6tr7kMA4lP--DRX0bGt_Zw, PqG7ctFYosiON13SP1RblA] -> [cWMT3MFTB0JnBFKnqfobDQ, 
ctS9lAE9khloi7vrRQjg9Q, 6tr7kMA4lP--DRX0bGt_Zw], isr: [5, 7, 12] -> [5, 7, 0], 
removingReplicas: [12] -> [], addingReplicas: [0] -> [], leader: 12 -> 0, 
leaderEpoch: 16 -> 17, partitionEpoch: 37 -> 38 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-100-event-handler] {code}
During the UNCLEAN, it chooses the only replica that is not in sync.
I was wondering if this could possibly involve the 
[KIP-966|https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas]
 boilerplate in the code (as this seems like an issue of picking ELR instead of 
the ISR).
Hope this helps and we can find a fix soon, because we are currently holding 
off any other KRaft migration process and looking at reverting the upgrades we 
already di,d since the clusters are stuck without the ability to grow or 
rebalance

 


was (Author: JIRAUSER310399):
Hey, [~azhar2407] I noticed this issue happening in our clusters as well.
We are using Kafka with KRaft version 3.9.0, and this happens to us on any 
action we do with CC (add broker, remove broker or rebalance the cluster).
I tried debugging it, but couldn't really pinpoint where the issue was.
Things I could gather:
{code:java}
2025-07-16 10:43:50,852 INFO [QuorumController id=100] Replayed partition 
assignment change PartitionChangeRecord(partitionId=9, 
topicId=ZanFWE05S6aTL-yqdhGu2Q, isr=null, leader=-2, replicas=[0, 5, 7, 12], 
removingReplicas=[12], addingReplicas=[0], leaderRecoveryState=-1, 
directories=[cWMT3MFTB0JnBFKnqfobDQ, ctS9lAE9khloi7vrRQjg9Q, 
6tr7kMA4lP--DRX0bGt_Zw, PqG7ctFYosiON13SP1RblA], eligibleLeaderReplicas=null, 
lastKnownElr=null) for topic leefa-audit 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-100-event-handler] {code}
The cluster tries to remove the leader partition without triggering a leader 
change (the isr is null for some reason).
After which, the UNCLEAN election is triggered 
{code:java}
2025-07-16 10:44:01,851 INFO [QuorumController id=100] UNCLEAN partition change 
for leefa-audit-9 with topic ID ZanFWE05S6aTL-yqdhGu2Q: replicas: [0, 5, 7, 12] 
-> [0, 5, 7], directories: [cWMT3MFTB0JnBFKnqfobDQ, ctS9lAE9khloi7vrRQjg9Q, 
6tr7kMA4lP--DRX0bGt_Zw, PqG7ctFYosiON13SP1RblA] -> [cWMT3MFTB0JnBFKnqfobDQ, 
ctS9lAE9khloi7vrRQjg9Q, 6tr7kMA4lP--DRX0bGt_Zw], isr: [5, 7, 12] -> [5, 7, 0], 
removingReplicas: [12] -> [], addingReplicas: [0] -> [], leader: 12 -> 0, 
leaderEpoch: 16 -> 17, partitionEpoch: 37 -> 38 
(org.apache.kafka.controller.ReplicationControlManager) 
[quorum-controller-100-event-handler] {code}
During the UNCLEAN, it chooses the only replica that is not in sync.
I was wondering if this could possibly involve the 
[KIP-966|https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas]
 boilerplate in the code (as this seems like an issue of picking ELR instead of 
the ISR).
Hope this helps and we can find a fix soon, because we are currently holding 
off any other KRaft migration process and looking at reverting the upgrades we 
already di,d since the clusters are stuck without the ability to grow or 
rebalance

 

> Potential Unclean Leader Election in KRaft Despite 
> unclean.leader.election.enable=false
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-19148
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19148
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 3.9.0, 4.0.0
>            Reporter: Julian Bergner
>            Assignee: Azhar Ahmed
>            Priority: Critical
>         Attachments: Readme_Kraft.md, Readme_Zookeeper.md, 
> docker-compose_kraft.yml, docker-compose_zookeeper.yml
>
>
> *Issue Summary:*
> We're observing unclean leader election even though 
> {{{}unclean.leader.election.enable=false{}}}.
> *Scenario:*
> During a partition reassignment, if we promote a non-ISR broker to leader and 
> simultaneously remove the current leader from the ISR, Kafka still elects a 
> new leader from outside the ISR. This behavior contradicts the expected 
> behavior when unclean leader election is explicitly disabled.
> *Details:*
>  * *Original ISR:* [1, 2]
>  * *New ISR after reassignment:* [3, 2]
> *Kafka Versions Tested:*
>  * Kafka 4.0.0 (KRaft mode)
>  * Kafka 3.9.0 (Kraft mode)
>  * Kafka 3.9.0 (Zookeeper mode)
> *Observation:*
>  * The behaviour differs between the two modes.
>  * In Kraft, unclean leader election occurred, which should not happen with 
> the config set to {{{}false{}}}.
>  * In Zookeeper no unclean leader election occurred.
> *Attachments:*
> Docker Compose files and reproduction steps for both:
>  * Kafka 4.0.0 (KRaft)
>  * Kafka 3.9.0 (Zookeeper)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-19148) Potential Unclean Leader Election in KRaft Despite unclean.leader.election.enable=false

Reply via email to