[ https://issues.apache.org/jira/browse/KAFKA-15371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17868447#comment-17868447 ]
Rashmi commented on KAFKA-15371: -------------------------------- Facing similar error message as [~endoftime] , but executing a different workflow: 3 Node cluster with instance of Kafka in KRaft mode running on each node. Simulating an unexpected shutdown by powering off all VMs in the cluster. Then, bringing them back by powering on all the VMs. We see Kafka-1 and Kafka-2 pods (previously hot standbys before shutdown) restart successfully, but Kafka-0 pod status goes into Crash Loop Backoff. Checking the logs of Kafka-0 we see: {quote}[2024-07-22 22:40:00,622] ERROR Encountered fatal fault: Error loading metadata log record from offset 8683181 (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) java.lang.IllegalStateException: Tried to change broker 1, but that broker was not registered. at org.apache.kafka.image.ClusterDelta.getBrokerOrThrow(ClusterDelta.java:83) at org.apache.kafka.image.ClusterDelta.replay(ClusterDelta.java:112) at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:321) at org.apache.kafka.image.MetadataDelta.replay(MetadataDelta.java:229) at org.apache.kafka.image.loader.MetadataBatchLoader.replay(MetadataBatchLoader.java:257) at org.apache.kafka.image.loader.MetadataBatchLoader.loadBatch(MetadataBatchLoader.java:135) at org.apache.kafka.image.loader.MetadataLoader.lambda$handleCommit$1(MetadataLoader.java:361) at org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) at org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) at java.base/java.lang.Thread.run(Thread.java:829) {quote} Cleaning up the log files for corresponding broker resolves the issue – it is able to come up and join the quorum, but this is not an ideal or preferred solution to the issue. Why does this happen exactly? And how can the broker recover after unexpected shutdown, without a manual cleaning up of its logs? Kafka Version: 3.6.0 Steps to reproduce: 1. Bring up Kafka (KRaft mode) on 3 Node cluster. 2. Power off VMs. 3. Restart VMs. 2 Kafka pods will enter Ready state, the last will go into Crash Loop Backoff. > MetadataShell is stuck when bootstrapping > ----------------------------------------- > > Key: KAFKA-15371 > URL: https://issues.apache.org/jira/browse/KAFKA-15371 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.5.1 > Reporter: Deng Ziming > Priority: Major > Attachments: image-2023-08-17-10-35-01-039.png, > image-2023-08-17-10-35-36-067.png, image-2023-10-31-09-04-53-966.png, > image-2023-10-31-09-11-53-118.png, image-2023-10-31-09-12-19-051.png, > image-2023-10-31-09-15-34-821.png > > > I downloaded the 3.5.1 package and startup it, then use metadata shell to > inspect the data > > {code:java} > // shell > bin/kafka-metadata-shell.sh --snapshot > /tmp/kraft-combined-logs/__cluster_metadata-0/00000000000000000000.log {code} > Then process will stuck at loading. > > > !image-2023-08-17-10-35-36-067.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)