[ https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869676#comment-17869676 ]
David Arthur edited comment on KAFKA-17146 at 7/30/24 3:30 PM: --------------------------------------------------------------- [~saimon46] by chance, did you previously attempt a migration on this cluster and revert it? It looks like the there is a mismatch between the state in the /migration ZNode and the controller. When the controller starts up, it reads the latest state of /migration from ZK. It appears that the migration was already completed {code:java} [2024-07-15 12:24:17,254] INFO [KRaftMigrationDriver id=3000] Recovering migration state from ZK. Transitioned migration state from ZkMigrationLeadershipState{kraftControllerId=-1, kraftControllerEpoch=-1, kraftMetadataOffset=-1, kraftMetadataEpoch=-1, lastUpdatedTimeMs=-1, migrationZkVersion=-2, controllerZkEpoch=-1, controllerZkVersion=-2} to ZkMigrationLeadershipState{kraftControllerId=2000, kraftControllerEpoch=2142, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=-1, controllerZkVersion=-2} (org.apache.kafka.metadata.migration.KRaftMigrationDriver) [2024-07-15 12:24:17,255] INFO [KRaftMigrationDriver id=3000] Initial migration of ZK metadata is done. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} However, when activating the controller, it sees that the local log is empty and so initializes itself in pre-migration mode {code:java} [2024-07-15 12:24:20,085] WARN [QuorumController id=3000] Performing controller activation. The metadata log appears to be empty. Appending 2 bootstrap record(s) in metadata transaction at metadata.version 3.7-IV4 from bootstrap source 'the binary bootstrap metadata file: /data/kafka_controller/bootstrap.checkpoint'. Putting the controller into pre-migration mode. No metadata updates will be allowed until the ZK metadata has been migrated. (org.apache.kafka.controller.QuorumController) {code} Once KRaftMigrationDriver handles the leader event, it thinks the migration is already done based on the /migration state {code:java} [2024-07-15 12:44:06,255] TRACE [KRaftMigrationDriver id=3000] Claiming ZK controller leadership. Transitioned migration state from ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=-1, controllerZkVersion=-2} to ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=21, controllerZkVersion=21} (org.apache.kafka.metadata.migration.KRaftMigrationDriver) ... [2024-07-15 12:44:06,276] INFO [KRaftMigrationDriver id=3000] 3000 transitioning from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK state (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} Transitioning directly from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK only happens when KRaftMigrationDriver sees that the migration is already complete. Can you try reverting the migration and trying it again? * Stop KRaft controllers * Delete the /controller ZNode * Delete the /migration ZNode * Reformat the KRaft controllers * Start KRaft controllers was (Author: davidarthur): [~saimon46] by chance, did you previously attempt a migration on this cluster and revert it? It looks like the there is a mismatch between the state in the /migration ZNode and the controller. When the controller starts up, it reads the latest state of /migration from ZK. It appears that the migration was already completed {code:java} [2024-07-15 12:24:17,254] INFO [KRaftMigrationDriver id=3000] Recovering migration state from ZK. Transitioned migration state from ZkMigrationLeadershipState{kraftControllerId=-1, kraftControllerEpoch=-1, kraftMetadataOffset=-1, kraftMetadataEpoch=-1, lastUpdatedTimeMs=-1, migrationZkVersion=-2, controllerZkEpoch=-1, controllerZkVersion=-2} to ZkMigrationLeadershipState{kraftControllerId=2000, kraftControllerEpoch=2142, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=-1, controllerZkVersion=-2} (org.apache.kafka.metadata.migration.KRaftMigrationDriver) [2024-07-15 12:24:17,255] INFO [KRaftMigrationDriver id=3000] Initial migration of ZK metadata is done. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} However, when activating the controller, it sees that the local log is empty and so initializes itself in pre-migration mode {code:java} [2024-07-15 12:24:20,085] WARN [QuorumController id=3000] Performing controller activation. The metadata log appears to be empty. Appending 2 bootstrap record(s) in metadata transaction at metadata.version 3.7-IV4 from bootstrap source 'the binary bootstrap metadata file: /data/kafka_controller/bootstrap.checkpoint'. Putting the controller into pre-migration mode. No metadata updates will be allowed until the ZK metadata has been migrated. (org.apache.kafka.controller.QuorumController) {code} * en once KRaftMigration * Driver handles the leader event, it thinks the migration is already done {code:java} [2024-07-15 12:44:06,255] TRACE [KRaftMigrationDriver id=3000] Claiming ZK controller leadership. Transitioned migration state from ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=-1, controllerZkVersion=-2} to ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, kraftMetadataOffset=534009, kraftMetadataEpoch=2142, lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, controllerZkEpoch=21, controllerZkVersion=21} (org.apache.kafka.metadata.migration.KRaftMigrationDriver) ... [2024-07-15 12:44:06,276] INFO [KRaftMigrationDriver id=3000] 3000 transitioning from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK state (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} Transitioning directly from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK only happens when KRaftMigrationDriver sees that the migration is already complete. Can you try reverting the migration and trying it again? * Stop KRaft controllers * Delete the /controller ZNode * Delete the /migration ZNode * Reformat the KRaft controllers * Start KRaft controllers > ZK to KRAFT migration stuck in pre-migration mode > ------------------------------------------------- > > Key: KAFKA-17146 > URL: https://issues.apache.org/jira/browse/KAFKA-17146 > Project: Kafka > Issue Type: Bug > Components: controller, kraft, migration > Affects Versions: 3.7.0, 3.7.1 > Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 > Zookeeper/KRAFT > Reporter: Simone Brundu > Priority: Blocker > Labels: kraft, migration, zookeeper > > I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. > (EDIT: the same issue happens for version 3.7.0) > I'm using this configuration to allow SSL everywhere and, SCRAM > authentication only for brokers and PLAIN authentication for controllers > {code:java} > listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL > inter.broker.listener.name=EXTERNAL_SASL > sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN > sasl.mechanism=SCRAM-SHA-512 > sasl.mechanism.controller.protocol=PLAIN > sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} > The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers > cluster with 3 KRAFT controllers is configured and running in parallel as per > documentation for the migration process. > I’ve started the migration with 3 controllers enrolled with SASL_SSL with > PLAIN authentication and I already have a strange TRACE log: > {code:java} > TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the > controller is not in dual-write mode. Ignoring the change to be replicated to > Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > With later this message where KRAFT is waiting to brokers to connect > {code:java} > INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting > for brokers to register. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon I start to reconfigure the brokers letting them to connect to the new > controllers, all good in the KRAFT controllers with notifications that the > KRAFT brokers were connecting correctly connected and enrolled > {code:java} > INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for > broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, > incarnationId=xxxxxx, brokerEpoch=2638, > endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', > port=9095, securityProtocol=3)], > features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, > maxSupportedVersion=19)], rack='zur1', fenced=true, > inControlledShutdown=false, logDirs=[xxxxxx]) > (org.apache.kafka.controller.ClusterControlManager) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon the first broker is connected we start to get these info logs related > to the migration process in the controller: > {code:java} > INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas > in pre-migration mode. Returning NOT_CONTROLLER. > (org.apache.kafka.controller.QuorumController) > INFO [QuorumController id=1000] maybeFenceReplicas: event failed with > NotControllerException in 355 microseconds. Exception message: The controller > is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code} > but as well requests to autocreate topics that exist already, in loop every > 30seconds, in the last broker restarted: > {code:java} > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) {code} > Up to the moment we have a controller still in the old cluster (in the kafka > brokers) everything runs fine. As soon the last node is restarted the things > are going out of the rail. This last node never gets any partition assigned > and the cluster stays forever in with under replicated partitions. This is > the log from the last node register that should start the migration mode, but > the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. > {code:java} > INFO [QuorumController id=1000] The request from broker 2 to unfence has been > granted because it has caught up with the offset of its register broker > record 4101 > [...] > INFO [KRaftMigrationDriver id=1000] Ignoring image > MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, > lastContainedLogTimeMs=1721133091831) which does not contain a superset of > the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > The only way to recover the cluster is revert everything stopping clusters, > removing /controller from zookeeper and restore the Zookeeper only > configuration in the brokers. A cleanup of the controller is necessary too. > The migration never starts and the controllers never understand that they > have to migrate the data from Zookeeper. More than that, the new controller > claims to be the CONTROLLER but it refuses to be it. -- This message was sent by Atlassian Jira (v8.20.10#820010)