[jira] [Comment Edited] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

David Arthur (Jira) Tue, 30 Jul 2024 08:31:07 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869676#comment-17869676
 ]


David Arthur edited comment on KAFKA-17146 at 7/30/24 3:30 PM:
---------------------------------------------------------------

[~saimon46] by chance, did you previously attempt a migration on this cluster 
and revert it? It looks like the there is a mismatch between the state in the 
/migration ZNode and the controller.

 

When the controller starts up, it reads the latest state of /migration from ZK. 
It appears that the migration was already completed

 
{code:java}
[2024-07-15 12:24:17,254] INFO [KRaftMigrationDriver id=3000] Recovering 
migration state from ZK. Transitioned migration state from 
ZkMigrationLeadershipState{kraftControllerId=-1, kraftControllerEpoch=-1, 
kraftMetadataOffset=-1, kraftMetadataEpoch=-1, lastUpdatedTimeMs=-1, 
migrationZkVersion=-2, controllerZkEpoch=-1, controllerZkVersion=-2} to 
ZkMigrationLeadershipState{kraftControllerId=2000, kraftControllerEpoch=2142, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=-1, controllerZkVersion=-2} 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)

[2024-07-15 12:24:17,255] INFO [KRaftMigrationDriver id=3000] Initial migration 
of ZK metadata is done. 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)  {code}
 

 

However, when activating the controller, it sees that the local log is empty 
and so initializes itself in pre-migration mode

 
{code:java}
[2024-07-15 12:24:20,085] WARN [QuorumController id=3000] Performing controller 
activation. The metadata log appears to be empty. Appending 2 bootstrap 
record(s) in metadata transaction at metadata.version 3.7-IV4 from bootstrap 
source 'the binary bootstrap metadata file: 
/data/kafka_controller/bootstrap.checkpoint'. Putting the controller into 
pre-migration mode. No metadata updates will be allowed until the ZK metadata 
has been migrated. (org.apache.kafka.controller.QuorumController) {code}
 

Once KRaftMigrationDriver handles the leader event, it thinks the migration is 
already done based on the /migration state

 
{code:java}
[2024-07-15 12:44:06,255] TRACE [KRaftMigrationDriver id=3000] Claiming ZK 
controller leadership. Transitioned migration state from 
ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=-1, controllerZkVersion=-2} to 
ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=21, controllerZkVersion=21} 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)

...

[2024-07-15 12:44:06,276] INFO [KRaftMigrationDriver id=3000] 3000 
transitioning from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK state 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)
 {code}
 

 

Transitioning directly from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK only happens 
when KRaftMigrationDriver sees that the migration is already complete.

 

 

Can you try reverting the migration and trying it again?
 * Stop KRaft controllers
 * Delete the /controller ZNode
 * Delete the /migration ZNode
 * Reformat the KRaft controllers
 * Start KRaft controllers

 


was (Author: davidarthur):
[~saimon46] by chance, did you previously attempt a migration on this cluster 
and revert it? It looks like the there is a mismatch between the state in the 
/migration ZNode and the controller.

 

When the controller starts up, it reads the latest state of /migration from ZK. 
It appears that the migration was already completed

 
{code:java}
[2024-07-15 12:24:17,254] INFO [KRaftMigrationDriver id=3000] Recovering 
migration state from ZK. Transitioned migration state from 
ZkMigrationLeadershipState{kraftControllerId=-1, kraftControllerEpoch=-1, 
kraftMetadataOffset=-1, kraftMetadataEpoch=-1, lastUpdatedTimeMs=-1, 
migrationZkVersion=-2, controllerZkEpoch=-1, controllerZkVersion=-2} to 
ZkMigrationLeadershipState{kraftControllerId=2000, kraftControllerEpoch=2142, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=-1, controllerZkVersion=-2} 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)

[2024-07-15 12:24:17,255] INFO [KRaftMigrationDriver id=3000] Initial migration 
of ZK metadata is done. 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)  {code}
 

 

However, when activating the controller, it sees that the local log is empty 
and so initializes itself in pre-migration mode

 
{code:java}
[2024-07-15 12:24:20,085] WARN [QuorumController id=3000] Performing controller 
activation. The metadata log appears to be empty. Appending 2 bootstrap 
record(s) in metadata transaction at metadata.version 3.7-IV4 from bootstrap 
source 'the binary bootstrap metadata file: 
/data/kafka_controller/bootstrap.checkpoint'. Putting the controller into 
pre-migration mode. No metadata updates will be allowed until the ZK metadata 
has been migrated. (org.apache.kafka.controller.QuorumController) {code}
 

 
 * en once KRaftMigration
 * Driver handles the leader event, it thinks the migration is already done

 
{code:java}
[2024-07-15 12:44:06,255] TRACE [KRaftMigrationDriver id=3000] Claiming ZK 
controller leadership. Transitioned migration state from 
ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=-1, controllerZkVersion=-2} to 
ZkMigrationLeadershipState{kraftControllerId=3000, kraftControllerEpoch=2, 
kraftMetadataOffset=534009, kraftMetadataEpoch=2142, 
lastUpdatedTimeMs=1721037470273, migrationZkVersion=491092, 
controllerZkEpoch=21, controllerZkVersion=21} 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)

...

[2024-07-15 12:44:06,276] INFO [KRaftMigrationDriver id=3000] 3000 
transitioning from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK state 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver)
 {code}
 

 

Transitioning directly from BECOME_CONTROLLER to SYNC_KRAFT_TO_ZK only happens 
when KRaftMigrationDriver sees that the migration is already complete.

 

 

Can you try reverting the migration and trying it again?
 * Stop KRaft controllers
 * Delete the /controller ZNode
 * Delete the /migration ZNode
 * Reformat the KRaft controllers
 * Start KRaft controllers

 

> ZK to KRAFT migration stuck in pre-migration mode
> -------------------------------------------------
>
>                 Key: KAFKA-17146
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17146
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, kraft, migration
>    Affects Versions: 3.7.0, 3.7.1
>         Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 
> Zookeeper/KRAFT
>            Reporter: Simone Brundu
>            Priority: Blocker
>              Labels: kraft, migration, zookeeper
>
> I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. 
> (EDIT: the same issue happens for version 3.7.0)
> I'm using this configuration to allow SSL everywhere and, SCRAM 
> authentication only for brokers and PLAIN authentication for controllers 
> {code:java}
> listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL
> inter.broker.listener.name=EXTERNAL_SASL
> sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
> sasl.mechanism=SCRAM-SHA-512
> sasl.mechanism.controller.protocol=PLAIN
> sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
> The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
> cluster with 3 KRAFT controllers is configured and running in parallel as per 
> documentation for the migration process.
> I’ve started the migration with 3 controllers enrolled with SASL_SSL with 
> PLAIN authentication and I already have a strange TRACE log:
> {code:java}
> TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
> controller is not in dual-write mode. Ignoring the change to be replicated to 
> Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> With later this message where KRAFT is waiting to brokers to connect
> {code:java}
> INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting 
> for brokers to register. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon I start to reconfigure the brokers letting them to connect to the new 
> controllers, all good in the KRAFT controllers with notifications that the 
> KRAFT brokers were connecting correctly connected and enrolled
> {code:java}
> INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
> broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
> incarnationId=xxxxxx, brokerEpoch=2638, 
> endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
> port=9095, securityProtocol=3)], 
> features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
> maxSupportedVersion=19)], rack='zur1', fenced=true, 
> inControlledShutdown=false, logDirs=[xxxxxx]) 
> (org.apache.kafka.controller.ClusterControlManager)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon the first broker is connected we start to get these info logs related 
> to the migration process in the controller:
> {code:java}
> INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
> in pre-migration mode. Returning NOT_CONTROLLER. 
> (org.apache.kafka.controller.QuorumController)
> INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
> NotControllerException in 355 microseconds. Exception message: The controller 
> is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code}
> but as well requests to autocreate topics that exist already, in loop every 
> 30seconds, in the last broker restarted:
> {code:java}
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager) {code}
> Up to the moment we have a controller still in the old cluster (in the kafka 
> brokers) everything runs fine. As soon the last node is restarted the things 
> are going out of the rail. This last node never gets any partition assigned 
> and the cluster stays forever in with under replicated partitions. This is 
> the log from the last node register that should start the migration mode, but 
> the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
> {code:java}
> INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
> granted because it has caught up with the offset of its register broker 
> record 4101
> [...]
> INFO [KRaftMigrationDriver id=1000] Ignoring image 
> MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
> lastContainedLogTimeMs=1721133091831) which does not contain a superset of 
> the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> The only way to recover the cluster is revert everything stopping clusters, 
> removing /controller from zookeeper and restore the Zookeeper only 
> configuration in the brokers. A cleanup of the controller is necessary too.
> The migration never starts and the controllers never understand that they 
> have to migrate the data from Zookeeper. More than that, the new controller 
> claims to be the CONTROLLER but it refuses to be it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

Reply via email to