Shubham Raj created KAFKA-20022:
-----------------------------------
Summary: Kafka Dual Write Mode Sync Failure
Key: KAFKA-20022
URL: https://issues.apache.org/jira/browse/KAFKA-20022
Project: Kafka
Issue Type: Bug
Components: controller
Affects Versions: 3.9.0
Reporter: Shubham Raj
Assignee: David Arthur
Hi,
We migrated our Kafka cluster (v3.9) to *dual write mode* three weeks ago as
part of the planned one-month transition away from ZooKeeper. Recently, the
controller sync between the ZooKeeper and KRaft metadata went out of alignment.
As a result, the cluster is no longer in dual write mode. As proposed in
[KAFKA-16171|https://issues.apache.org/jira/browse/KAFKA-16171], attempts to
restart the controllers did not restore sync, and the ZooKeeper metadata is now
lagging behind as per logs.
*Impact*
* Dual write mode is no longer active, increasing risk of metadata divergence.
* ZooKeeper metadata is stale compared to KRaft.
* Migration timeline is at risk.
*Repetitive Logs in leader controller:*
{code:java}
[2025-12-29 01:44:23,852] ERROR Encountered zk migration fault: Unhandled error
in SyncKRaftMetadataEvent (org.apache.kafka.server.fault.LoggingFaultHandler)
java.lang.RuntimeException: Check op on KRaft Migration ZNode failed. Sent
zkVersion = 5349155. This indicates that another KRaft controller is making
writes to ZooKeeper.
at
kafka.zk.KafkaZkClient.handleUnwrappedMigrationResult$1(KafkaZkClient.scala:2050)
at
kafka.zk.KafkaZkClient.unwrapMigrationResponse$1(KafkaZkClient.scala:2076)
at
kafka.zk.KafkaZkClient.$anonfun$retryMigrationRequestsUntilConnected$2(KafkaZkClient.scala:2101)
at
scala.collection.StrictOptimizedIterableOps.map(StrictOptimizedIterableOps.scala:100)
at
scala.collection.StrictOptimizedIterableOps.map$(StrictOptimizedIterableOps.scala:87)
at scala.collection.mutable.ArrayBuffer.map(ArrayBuffer.scala:42)
at
kafka.zk.KafkaZkClient.retryMigrationRequestsUntilConnected(KafkaZkClient.scala:2101)
at
kafka.zk.migration.ZkTopicMigrationClient.$anonfun$createTopic$1(ZkTopicMigrationClient.scala:137)
at
kafka.zk.migration.ZkTopicMigrationClient.createTopic(ZkTopicMigrationClient.scala:111)
at
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$null$3(KRaftMigrationZkWriter.java:233)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:246)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver.applyMigrationOperation(KRaftMigrationDriver.java:240)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver.access$300(KRaftMigrationDriver.java:63)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.lambda$run$0(KRaftMigrationDriver.java:844)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver.lambda$countingOperationConsumer$6(KRaftMigrationDriver.java:970)
at
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.lambda$handleTopicsSnapshot$4(KRaftMigrationZkWriter.java:230)
at java.base/java.lang.Iterable.forEach(Iterable.java:75)
at
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleTopicsSnapshot(KRaftMigrationZkWriter.java:228)
at
org.apache.kafka.metadata.migration.KRaftMigrationZkWriter.handleSnapshot(KRaftMigrationZkWriter.java:96)
at
org.apache.kafka.metadata.migration.KRaftMigrationDriver$SyncKRaftMetadataEvent.run(KRaftMigrationDriver.java:843)
at
org.apache.kafka.queue.KafkaEventQueue$EventContext.run(KafkaEventQueue.java:132)
at
org.apache.kafka.queue.KafkaEventQueue$EventHandler.handleEvents(KafkaEventQueue.java:215)
at
org.apache.kafka.queue.KafkaEventQueue$EventHandler.run(KafkaEventQueue.java:186)
at java.base/java.lang.Thread.run(Thread.java:840)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received
metadata delta, but the controller is not in dual-write mode. Ignoring this
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received
metadata delta, but the controller is not in dual-write mode. Ignoring this
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received
metadata delta, but the controller is not in dual-write mode. Ignoring this
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received
metadata delta, but the controller is not in dual-write mode. Ignoring this
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[2025-12-29 01:44:23,852] TRACE [KRaftMigrationDriver id=10002] Received
metadata delta, but the controller is not in dual-write mode. Ignoring this
metadata update. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
{code}
*cluster status*
{code:java}
ClusterId: QBC8K1kNS02Sl9930_QDAA
LeaderId: 10002
LeaderEpoch: 253
HighWatermark: 12515984
MaxFollowerLag: 0
MaxFollowerLagTimeMs: 111
CurrentVoters: [10001,10002,10003]
CurrentObservers:
[5,4,1104,6,3,1,1103,1112,1108,1107,1109,1101,1102,2,1106,1110,1111,1105]
{code}
*Migration data in zookeeper*
{code:java}
In [4]: zk_client.get('/kafka/qa/migration')
Out[4]:
(b'{"version":0,"kraft_metadata_offset":10840503,"kraft_controller_id":10002,"kraft_metadata_epoch":170,"kraft_controller_epoch":253}',
ZnodeStat(czxid=7129652820618, mzxid=7176892421062, ctime=1765176701970,
mtime=1766986727917, version=5349156, cversion=0, aversion=0, ephemeralOwner=0,
dataLength=130, numChildren=0, pzxid=7129652820618))
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)