[
https://issues.apache.org/jira/browse/KAFKA-20515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080803#comment-18080803
]
Luke Chen commented on KAFKA-20515:
-----------------------------------
[~alyssahuang] , in case of users encountering this issue, do you have any
workaround for this issue?
> ZK leader failover during ZK migration can block migration
> ----------------------------------------------------------
>
> Key: KAFKA-20515
> URL: https://issues.apache.org/jira/browse/KAFKA-20515
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 3.9.0
> Reporter: Alyssa Huang
> Priority: Major
>
> Similar issue was fixed in https://issues.apache.org/jira/browse/KAFKA-16171,
> the symptoms are similar
> {code:java}
> org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode =
> BadVersion for /migration
> ...
> java.lang.RuntimeException: Conditional update on KRaft Migration ZNode
> failed. Sent zkVersion = X. The failed write was:
> ZkMigrationLeadershipState{kraftControllerId=<>, kraftControllerEpoch=<>,
> kraftMetadataOffset=<>, kraftMetadataEpoch=<>, lastUpdatedTimeMs=-1,
> migrationZkVersion=X, controllerZkEpoch=-1, controllerZkVersion=-2}. This
> indicates that another KRaft controller is making writes to ZooKeeper.{code}
> But the underlying root cause is different:
> In KAFKA-16171, the trigger is from the controller failing over during
> migration, in this issue the trigger is the ZK leader failing over before
> sending back an ACK to the controller for a successful /migration node change.
> * The KRaft controller sends to ZK "set /migration, expected current
> dataVersion = N"
> * The ZK leader receives this request, writes new value to /migration and
> sets dataVersion to `N + 1`. ZK leader replicates this to follower nodes but
> shuts down before sending the success reply back to the controller.
> * The controller will retry the same request after reconnecting with ZK, but
> it now expects `N` whereas ZK has already moved onto `N + 1`
--
This message was sent by Atlassian Jira
(v8.20.10#820010)