Simone Brundu created KAFKA-17146:
-------------------------------------

             Summary: ZK to KRAFT migration stuck in pre-migration mode
                 Key: KAFKA-17146
                 URL: https://issues.apache.org/jira/browse/KAFKA-17146
             Project: Kafka
          Issue Type: Bug
          Components: controller, kraft, migration
    Affects Versions: 3.7.1
         Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 
Zookeeper/KRAFT
            Reporter: Simone Brundu


Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster.

I'm using this configuration to allow SSL everywhere and, SCRAM authentication 
only for brokers and PLAIN authentication for controllers 
{code:java}
listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL


inter.broker.listener.name=EXTERNAL_SASL
sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
sasl.mechanism=SCRAM-SHA-512
sasl.mechanism.controller.protocol=PLAIN
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
cluster with 3 KRAFT controllers is configured and running in parallel as per 
documentation for the migration process.
I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN 
authentication and I already have a strange TRACE log:
{code:java}
TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
controller is not in dual-write mode. Ignoring the change to be replicated to 
Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
With later this message where KRAFT is waiting to brokers to connect
{code:java}
INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for 
brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon I start to reconfigure the brokers letting them to connect to the new 
controllers, all good in the KRAFT controllers with notifications that the 
KRAFT brokers were connecting correctly connected and enrolled
{code:java}
INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
incarnationId=xxxxxx, brokerEpoch=2638, 
endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
port=9095, securityProtocol=3)], 
features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, 
logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon the first broker is connected we start to get these info logs
{code:java}
INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
in pre-migration mode. Returning NOT_CONTROLLER. 
(org.apache.kafka.controller.QuorumController)
INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
NotControllerException in 355 microseconds. Exception message: The controller 
is in pre-migration mode. (org.apache.kafka.controller.QuorumController) {code}
Up to the moment we have a controller still in the old cluster (in the kafka 
brokers) everything runs fine. As soon the last node is restarted the things 
are going out of the rail. This last node never gets any partition assigned and 
the cluster stays forever in with under replicated partitions. This is the log 
from the last node register that should start the migration mode, but the 
cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
{code:java}
INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
granted because it has caught up with the offset of its register broker record 
4101
[...]
INFO [KRaftMigrationDriver id=1000] Ignoring image 
MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
lastContainedLogTimeMs=1721133091831) which does not contain a superset of the 
metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
The only way to recover the cluster is revert everything stopping clusters, 
removing /controller from zookeeper and restore the Zookeeper only 
configuration in the brokers. A cleanup of the controller is necessary too.

The migration never starts and the controllers never understand that they have 
to migrate the data from Zookeeper. More than that, the new controller claims 
to be the CONTROLLER but it refuses to be it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to