[ https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Simone Brundu updated KAFKA-17146: ---------------------------------- Description: Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. I'm using this configuration to allow SSL everywhere and, SCRAM authentication only for brokers and PLAIN authentication for controllers {code:java} listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL inter.broker.listener.name=EXTERNAL_SASL sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN sasl.mechanism=SCRAM-SHA-512 sasl.mechanism.controller.protocol=PLAIN sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers cluster with 3 KRAFT controllers is configured and running in parallel as per documentation for the migration process. I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN authentication and I already have a strange TRACE log: {code:java} TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the controller is not in dual-write mode. Ignoring the change to be replicated to Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} With later this message where KRAFT is waiting to brokers to connect {code:java} INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} As soon I start to reconfigure the brokers letting them to connect to the new controllers, all good in the KRAFT controllers with notifications that the KRAFT brokers were connecting correctly connected and enrolled {code:java} INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, incarnationId=xxxxxx, brokerEpoch=2638, endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', port=9095, securityProtocol=3)], features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager) [...] INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) [...] INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} As soon the first broker is connected we start to get these info logs related to the migration process in the controller: {code:java} INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas in pre-migration mode. Returning NOT_CONTROLLER. (org.apache.kafka.controller.QuorumController) INFO [QuorumController id=1000] maybeFenceReplicas: event failed with NotControllerException in 355 microseconds. Exception message: The controller is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code} but as well requests to autocreate topics that exist already, in loop every 30seconds, in the last broker restarted: {code:java} INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager) INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager) INFO Sent auto-creation request for Set(_schemas) to the active controller. (kafka.server.DefaultAutoTopicCreationManager) {code} Up to the moment we have a controller still in the old cluster (in the kafka brokers) everything runs fine. As soon the last node is restarted the things are going out of the rail. This last node never gets any partition assigned and the cluster stays forever in with under replicated partitions. This is the log from the last node register that should start the migration mode, but the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. {code:java} INFO [QuorumController id=1000] The request from broker 2 to unfence has been granted because it has caught up with the offset of its register broker record 4101 [...] INFO [KRaftMigrationDriver id=1000] Ignoring image MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, lastContainedLogTimeMs=1721133091831) which does not contain a superset of the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} The only way to recover the cluster is revert everything stopping clusters, removing /controller from zookeeper and restore the Zookeeper only configuration in the brokers. A cleanup of the controller is necessary too. The migration never starts and the controllers never understand that they have to migrate the data from Zookeeper. More than that, the new controller claims to be the CONTROLLER but it refuses to be it. was: Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. I'm using this configuration to allow SSL everywhere and, SCRAM authentication only for brokers and PLAIN authentication for controllers {code:java} listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL inter.broker.listener.name=EXTERNAL_SASL sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN sasl.mechanism=SCRAM-SHA-512 sasl.mechanism.controller.protocol=PLAIN sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers cluster with 3 KRAFT controllers is configured and running in parallel as per documentation for the migration process. I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN authentication and I already have a strange TRACE log: {code:java} TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the controller is not in dual-write mode. Ignoring the change to be replicated to Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} With later this message where KRAFT is waiting to brokers to connect {code:java} INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} As soon I start to reconfigure the brokers letting them to connect to the new controllers, all good in the KRAFT controllers with notifications that the KRAFT brokers were connecting correctly connected and enrolled {code:java} INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, incarnationId=xxxxxx, brokerEpoch=2638, endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', port=9095, securityProtocol=3)], features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager) [...] INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) [...] INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} As soon the first broker is connected we start to get these info logs {code:java} INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas in pre-migration mode. Returning NOT_CONTROLLER. (org.apache.kafka.controller.QuorumController) INFO [QuorumController id=1000] maybeFenceReplicas: event failed with NotControllerException in 355 microseconds. Exception message: The controller is in pre-migration mode. (org.apache.kafka.controller.QuorumController) {code} Up to the moment we have a controller still in the old cluster (in the kafka brokers) everything runs fine. As soon the last node is restarted the things are going out of the rail. This last node never gets any partition assigned and the cluster stays forever in with under replicated partitions. This is the log from the last node register that should start the migration mode, but the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. {code:java} INFO [QuorumController id=1000] The request from broker 2 to unfence has been granted because it has caught up with the offset of its register broker record 4101 [...] INFO [KRaftMigrationDriver id=1000] Ignoring image MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, lastContainedLogTimeMs=1721133091831) which does not contain a superset of the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} The only way to recover the cluster is revert everything stopping clusters, removing /controller from zookeeper and restore the Zookeeper only configuration in the brokers. A cleanup of the controller is necessary too. The migration never starts and the controllers never understand that they have to migrate the data from Zookeeper. More than that, the new controller claims to be the CONTROLLER but it refuses to be it. > ZK to KRAFT migration stuck in pre-migration mode > ------------------------------------------------- > > Key: KAFKA-17146 > URL: https://issues.apache.org/jira/browse/KAFKA-17146 > Project: Kafka > Issue Type: Bug > Components: controller, kraft, migration > Affects Versions: 3.7.1 > Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 > Zookeeper/KRAFT > Reporter: Simone Brundu > Priority: Blocker > Labels: kraft, migration, zookeeper > > Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster. > I'm using this configuration to allow SSL everywhere and, SCRAM > authentication only for brokers and PLAIN authentication for controllers > {code:java} > listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL > inter.broker.listener.name=EXTERNAL_SASL > sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN > sasl.mechanism=SCRAM-SHA-512 > sasl.mechanism.controller.protocol=PLAIN > sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code} > The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers > cluster with 3 KRAFT controllers is configured and running in parallel as per > documentation for the migration process. > I’ve started the migration with 3 controllers enrolled with SASL_SSL with > PLAIN authentication and I already have a strange TRACE log: > {code:java} > TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the > controller is not in dual-write mode. Ignoring the change to be replicated to > Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > With later this message where KRAFT is waiting to brokers to connect > {code:java} > INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting > for brokers to register. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon I start to reconfigure the brokers letting them to connect to the new > controllers, all good in the KRAFT controllers with notifications that the > KRAFT brokers were connecting correctly connected and enrolled > {code:java} > INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for > broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, > incarnationId=xxxxxx, brokerEpoch=2638, > endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', > port=9095, securityProtocol=3)], > features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, > maxSupportedVersion=19)], rack='zur1', fenced=true, > inControlledShutdown=false, logDirs=[xxxxxx]) > (org.apache.kafka.controller.ClusterControlManager) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) > [...] > INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to > register with KRaft. > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > As soon the first broker is connected we start to get these info logs related > to the migration process in the controller: > {code:java} > INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas > in pre-migration mode. Returning NOT_CONTROLLER. > (org.apache.kafka.controller.QuorumController) > INFO [QuorumController id=1000] maybeFenceReplicas: event failed with > NotControllerException in 355 microseconds. Exception message: The controller > is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code} > but as well requests to autocreate topics that exist already, in loop every > 30seconds, in the last broker restarted: > {code:java} > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) > INFO Sent auto-creation request for Set(_schemas) to the active controller. > (kafka.server.DefaultAutoTopicCreationManager) {code} > Up to the moment we have a controller still in the old cluster (in the kafka > brokers) everything runs fine. As soon the last node is restarted the things > are going out of the rail. This last node never gets any partition assigned > and the cluster stays forever in with under replicated partitions. This is > the log from the last node register that should start the migration mode, but > the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode. > {code:java} > INFO [QuorumController id=1000] The request from broker 2 to unfence has been > granted because it has caught up with the offset of its register broker > record 4101 > [...] > INFO [KRaftMigrationDriver id=1000] Ignoring image > MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, > lastContainedLogTimeMs=1721133091831) which does not contain a superset of > the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded > (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code} > The only way to recover the cluster is revert everything stopping clusters, > removing /controller from zookeeper and restore the Zookeeper only > configuration in the brokers. A cleanup of the controller is necessary too. > The migration never starts and the controllers never understand that they > have to migrate the data from Zookeeper. More than that, the new controller > claims to be the CONTROLLER but it refuses to be it. -- This message was sent by Atlassian Jira (v8.20.10#820010)