[ https://issues.apache.org/jira/browse/KAFKA-16563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luke Chen updated KAFKA-16563: ------------------------------ Summary: migration to KRaft hanging after MigrationClientException (was: migration to KRaft hanging after KeeperException) > migration to KRaft hanging after MigrationClientException > --------------------------------------------------------- > > Key: KAFKA-16563 > URL: https://issues.apache.org/jira/browse/KAFKA-16563 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.7.0 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > > When running ZK migrating to KRaft process, we encountered an issue that the > migrating is hanging and the `ZkMigrationState` cannot move to `MIGRATION` > state. After investigation, the root cause is because the pollEvent didn't > retry with the retriable `MigrationClientException` (i.e. ZK client retriable > errors) while it should. And because of this, the poll event will not poll > anymore, which causes the KRaftMigrationDriver cannot work as expected. > > {code:java} > 2024-04-11 21:27:55,393 INFO [KRaftMigrationDriver id=5] Encountered > ZooKeeper error during event PollEvent. Will retry. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) > [controller-5-migration-driver-event-handler]org.apache.zookeeper.KeeperException$NodeExistsException: > KeeperErrorCode = NodeExists for /migration at > org.apache.zookeeper.KeeperException.create(KeeperException.java:126) at > org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at > kafka.zookeeper.AsyncResponse.maybeThrow(ZooKeeperClient.scala:570) at > kafka.zk.KafkaZkClient.createInitialMigrationState(KafkaZkClient.scala:1701) > at > kafka.zk.KafkaZkClient.getOrCreateMigrationState(KafkaZkClient.scala:1689) > at > kafka.zk.ZkMigrationClient.$anonfun$getOrCreateMigrationRecoveryState$1(ZkMigrationClient.scala:109) > at > kafka.zk.ZkMigrationClient.getOrCreateMigrationRecoveryState(ZkMigrationClient.scala:69) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:248) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.recoverMigrationStateFromZK(KRaftMigrationDriver.java:169) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$1900(KRaftMigrationDriver.java:62) > at > org.apache.kafka.metadata.migration.KRaftMigrationDriver$PollEvent.run(KRaftMigrationDriver.java:794) > at > org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:127) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:210) > at > org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:181) > at java.base/java.lang.Thread.run(Thread.java:840){code} -- This message was sent by Atlassian Jira (v8.20.10#820010)