[ https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simone Brundu updated KAFKA-17146: ---------------------------------- Affects Version/s: 3.7.0 > ZK to KRAFT migration stuck in pre-migration mode > ------------------------------------------------- > > Key: KAFKA-17146 > URL: https://issues.apache.org/jira/browse/KAFKA-17146 > Project: Kafka > Issue Type: Bug > Components: controller, kraft, migration > Affects Versions: 3.7.0, 3.7.1 > Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 > Zookeeper/KRAFT > Reporter: Simone Brundu > Priority: Blocker > Labels: kraft, migration, zookeeper > > Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. > I'm using this configuration to allow SSL everywhere and, SCRAM > authentication only for brokers and PLAIN authentication for controllers > {code:java} > listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL > inter.broker.listener.name=EXTERNAL_SASL > sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN > sasl.mechanism=SCRAM-SHA-512 > sasl.mechanism.controller.protocol=PLAIN > sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} > The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers > cluster with 3 KRAFT controllers is configured and running in parallel as per > documentation for the migration process. > I’ve started the migration with 3 controllers enrolled with SASL_SSL with > PLAIN authentication and I already have a strange TRACE log: > {code:java} > TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the > controller is not in dual-write mode. Ignoring the change to be replicated to > Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > With later this message where KRAFT is waiting to brokers to connect > {code:java} > INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting > for brokers to register. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon I start to reconfigure the brokers letting them to connect to the new > controllers, all good in the KRAFT controllers with notifications that the > KRAFT brokers were connecting correctly connected and enrolled > {code:java} > INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for > broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, > incarnationId=xxxxxx, brokerEpoch=2638, > endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', > port=9095, securityProtocol=3)], > features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, > maxSupportedVersion=19)], rack='zur1', fenced=true, > inControlledShutdown=false, logDirs=[xxxxxx]) > (org.apache.kafka.controller.ClusterControlManager) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon the first broker is connected we start to get these info logs related > to the migration process in the controller: > {code:java} > INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas > in pre-migration mode. Returning NOT_CONTROLLER. > (org.apache.kafka.controller.QuorumController) > INFO [QuorumController id=1000] maybeFenceReplicas: event failed with > NotControllerException in 355 microseconds. Exception message: The controller > is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code} > but as well requests to autocreate topics that exist already, in loop every > 30seconds, in the last broker restarted: > {code:java} > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) {code} > Up to the moment we have a controller still in the old cluster (in the kafka > brokers) everything runs fine. As soon the last node is restarted the things > are going out of the rail. This last node never gets any partition assigned > and the cluster stays forever in with under replicated partitions. This is > the log from the last node register that should start the migration mode, but > the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. > {code:java} > INFO [QuorumController id=1000] The request from broker 2 to unfence has been > granted because it has caught up with the offset of its register broker > record 4101 > [...] > INFO [KRaftMigrationDriver id=1000] Ignoring image > MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, > lastContainedLogTimeMs=1721133091831) which does not contain a superset of > the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > The only way to recover the cluster is revert everything stopping clusters, > removing /controller from zookeeper and restore the Zookeeper only > configuration in the brokers. A cleanup of the controller is necessary too. > The migration never starts and the controllers never understand that they > have to migrate the data from Zookeeper. More than that, the new controller > claims to be the CONTROLLER but it refuses to be it. -- This message was sent by Atlassian Jira (v8.20.10#820010)