edoardocomar opened a new pull request #11476: URL: https://github.com/apache/kafka/pull/11476
Add a call to `onControllerFailover` into code path where `elect` is called, and the broker discovers it has already been elected. We found that by restarting the ZK leader we could occasionally trigger this code path, and prior to this change it would not start a controller failover. This left our Kafka cluster in a state where the `/controller` znode existed, and named the broker that had "won" the controller election, but in terms of runtime state: all the brokers had resigned from being the controller. Without a running controller, restarting brokers would typically cause partitions to become under-replicated as the restarted brokers never received the UpdateMetadata or LeaderAndISR requests required to correctly lead / follow any of their replicas. Also add some info level logging and more descriptive log messages for the log lines that were helpful in tracking the controller failover. proposed fix for https://issues.apache.org/jira/browse/KAFKA-13407 Co-authored-by: Tina Selenge <gantigmaa.selen...@uk.ibm.com> Co-authored-by: Adrian Preston <prest...@uk.ibm.com> Co-authored-by: Edoardo Comar <eco...@euk.ibm.com.com> ### Committer Checklist (excluded from commit message) - [ ] Verify design and implementation - [ ] Verify test coverage and CI build status - [ ] Verify documentation (including upgrade notes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org