[jira] [Updated] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

Simone Brundu (Jira) Tue, 16 Jul 2024 08:47:36 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-17146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Simone Brundu updated KAFKA-17146:
----------------------------------
    Description: 
Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster.

I'm using this configuration to allow SSL everywhere and, SCRAM authentication 
only for brokers and PLAIN authentication for controllers 
{code:java}
listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL


inter.broker.listener.name=EXTERNAL_SASL
sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
sasl.mechanism=SCRAM-SHA-512
sasl.mechanism.controller.protocol=PLAIN
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
cluster with 3 KRAFT controllers is configured and running in parallel as per 
documentation for the migration process.
I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN 
authentication and I already have a strange TRACE log:
{code:java}
TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
controller is not in dual-write mode. Ignoring the change to be replicated to 
Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
With later this message where KRAFT is waiting to brokers to connect
{code:java}
INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for 
brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon I start to reconfigure the brokers letting them to connect to the new 
controllers, all good in the KRAFT controllers with notifications that the 
KRAFT brokers were connecting correctly connected and enrolled
{code:java}
INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
incarnationId=xxxxxx, brokerEpoch=2638, 
endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
port=9095, securityProtocol=3)], 
features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, 
logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon the first broker is connected we start to get these info logs related 
to the migration process in the controller:
{code:java}
INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
in pre-migration mode. Returning NOT_CONTROLLER. 
(org.apache.kafka.controller.QuorumController)
INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
NotControllerException in 355 microseconds. Exception message: The controller 
is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code}
but as well requests to autocreate topics that exist already, in loop every 
30seconds, in the last broker restarted:
{code:java}
INFO Sent auto-creation request for Set(_schemas) to the active controller. 
(kafka.server.DefaultAutoTopicCreationManager)
INFO Sent auto-creation request for Set(_schemas) to the active controller. 
(kafka.server.DefaultAutoTopicCreationManager)
INFO Sent auto-creation request for Set(_schemas) to the active controller. 
(kafka.server.DefaultAutoTopicCreationManager) {code}
Up to the moment we have a controller still in the old cluster (in the kafka 
brokers) everything runs fine. As soon the last node is restarted the things 
are going out of the rail. This last node never gets any partition assigned and 
the cluster stays forever in with under replicated partitions. This is the log 
from the last node register that should start the migration mode, but the 
cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
{code:java}
INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
granted because it has caught up with the offset of its register broker record 
4101
[...]
INFO [KRaftMigrationDriver id=1000] Ignoring image 
MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
lastContainedLogTimeMs=1721133091831) which does not contain a superset of the 
metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
The only way to recover the cluster is revert everything stopping clusters, 
removing /controller from zookeeper and restore the Zookeeper only 
configuration in the brokers. A cleanup of the controller is necessary too.

The migration never starts and the controllers never understand that they have 
to migrate the data from Zookeeper. More than that, the new controller claims 
to be the CONTROLLER but it refuses to be it.

  was:
Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster.

I'm using this configuration to allow SSL everywhere and, SCRAM authentication 
only for brokers and PLAIN authentication for controllers 
{code:java}
listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL


inter.broker.listener.name=EXTERNAL_SASL
sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
sasl.mechanism=SCRAM-SHA-512
sasl.mechanism.controller.protocol=PLAIN
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
cluster with 3 KRAFT controllers is configured and running in parallel as per 
documentation for the migration process.
I’ve started the migration with 3 controllers enrolled with SASL_SSL with PLAIN 
authentication and I already have a strange TRACE log:
{code:java}
TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
controller is not in dual-write mode. Ignoring the change to be replicated to 
Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
With later this message where KRAFT is waiting to brokers to connect
{code:java}
INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting for 
brokers to register. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon I start to reconfigure the brokers letting them to connect to the new 
controllers, all good in the KRAFT controllers with notifications that the 
KRAFT brokers were connecting correctly connected and enrolled
{code:java}
INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
incarnationId=xxxxxx, brokerEpoch=2638, 
endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
port=9095, securityProtocol=3)], 
features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
maxSupportedVersion=19)], rack='zur1', fenced=true, inControlledShutdown=false, 
logDirs=[xxxxxx]) (org.apache.kafka.controller.ClusterControlManager)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
[...]
INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
register with KRaft. (org.apache.kafka.metadata.migration.KRaftMigrationDriver) 
{code}
As soon the first broker is connected we start to get these info logs
{code:java}
INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
in pre-migration mode. Returning NOT_CONTROLLER. 
(org.apache.kafka.controller.QuorumController)
INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
NotControllerException in 355 microseconds. Exception message: The controller 
is in pre-migration mode. (org.apache.kafka.controller.QuorumController) {code}
Up to the moment we have a controller still in the old cluster (in the kafka 
brokers) everything runs fine. As soon the last node is restarted the things 
are going out of the rail. This last node never gets any partition assigned and 
the cluster stays forever in with under replicated partitions. This is the log 
from the last node register that should start the migration mode, but the 
cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
{code:java}
INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
granted because it has caught up with the offset of its register broker record 
4101
[...]
INFO [KRaftMigrationDriver id=1000] Ignoring image 
MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
lastContainedLogTimeMs=1721133091831) which does not contain a superset of the 
metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
(org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
The only way to recover the cluster is revert everything stopping clusters, 
removing /controller from zookeeper and restore the Zookeeper only 
configuration in the brokers. A cleanup of the controller is necessary too.

The migration never starts and the controllers never understand that they have 
to migrate the data from Zookeeper. More than that, the new controller claims 
to be the CONTROLLER but it refuses to be it.


> ZK to KRAFT migration stuck in pre-migration mode
> -------------------------------------------------
>
>                 Key: KAFKA-17146
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17146
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller, kraft, migration
>    Affects Versions: 3.7.1
>         Environment: Virtual machines isolated: 3 VMs with Kafka brokers + 3 
> Zookeeper/KRAFT
>            Reporter: Simone Brundu
>            Priority: Blocker
>              Labels: kraft, migration, zookeeper
>
> Hello I'm facing a migration from Zookeeper to KRAFT with Kafka 3.7.1 cluster.
> I'm using this configuration to allow SSL everywhere and, SCRAM 
> authentication only for brokers and PLAIN authentication for controllers 
> {code:java}
> listener.security.protocol.map=EXTERNAL_SASL:SASL_SSL,CONTROLLER:SASL_SSL
> inter.broker.listener.name=EXTERNAL_SASL
> sasl.enabled.mechanisms=SCRAM-SHA-512,PLAIN
> sasl.mechanism=SCRAM-SHA-512
> sasl.mechanism.controller.protocol=PLAIN
> sasl.mechanism.inter.broker.protocol=SCRAM-SHA-512 {code}
> The cluster has 3 brokers and 3 zookeeper nodes initially then a controllers 
> cluster with 3 KRAFT controllers is configured and running in parallel as per 
> documentation for the migration process.
> I’ve started the migration with 3 controllers enrolled with SASL_SSL with 
> PLAIN authentication and I already have a strange TRACE log:
> {code:java}
> TRACE [KRaftMigrationDriver id=3000] Received metadata delta, but the 
> controller is not in dual-write mode. Ignoring the change to be replicated to 
> Zookeeper (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> With later this message where KRAFT is waiting to brokers to connect
> {code:java}
> INFO [KRaftMigrationDriver id=1000] No brokers are known to KRaft, waiting 
> for brokers to register. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon I start to reconfigure the brokers letting them to connect to the new 
> controllers, all good in the KRAFT controllers with notifications that the 
> KRAFT brokers were connecting correctly connected and enrolled
> {code:java}
> INFO [QuorumController id=1000] Replayed initial RegisterBrokerRecord for 
> broker 1: RegisterBrokerRecord(brokerId=1, isMigratingZkBroker=true, 
> incarnationId=xxxxxx, brokerEpoch=2638, 
> endPoints=[BrokerEndpoint(name='EXTERNAL_SASL', host='vmk-tdtkafka-01', 
> port=9095, securityProtocol=3)], 
> features=[BrokerFeature(name='metadata.version', minSupportedVersion=19, 
> maxSupportedVersion=19)], rack='zur1', fenced=true, 
> inControlledShutdown=false, logDirs=[xxxxxx]) 
> (org.apache.kafka.controller.ClusterControlManager)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2, 3] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver)
> [...]
> INFO [KRaftMigrationDriver id=1000] Still waiting for ZK brokers [2] to 
> register with KRaft. 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> As soon the first broker is connected we start to get these info logs related 
> to the migration process in the controller:
> {code:java}
> INFO [QuorumController id=1000] Cannot run write operation maybeFenceReplicas 
> in pre-migration mode. Returning NOT_CONTROLLER. 
> (org.apache.kafka.controller.QuorumController)
> INFO [QuorumController id=1000] maybeFenceReplicas: event failed with 
> NotControllerException in 355 microseconds. Exception message: The controller 
> is in pre-migration mode. (org.apache.kafka.controller.QuorumController){code}
> but as well requests to autocreate topics that exist already, in loop every 
> 30seconds, in the last broker restarted:
> {code:java}
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager)
> INFO Sent auto-creation request for Set(_schemas) to the active controller. 
> (kafka.server.DefaultAutoTopicCreationManager) {code}
> Up to the moment we have a controller still in the old cluster (in the kafka 
> brokers) everything runs fine. As soon the last node is restarted the things 
> are going out of the rail. This last node never gets any partition assigned 
> and the cluster stays forever in with under replicated partitions. This is 
> the log from the last node register that should start the migration mode, but 
> the cluster stays forever in *SYNC_KRAFT_TO_ZK* state in *pre-migration* mode.
> {code:java}
> INFO [QuorumController id=1000] The request from broker 2 to unfence has been 
> granted because it has caught up with the offset of its register broker 
> record 4101
> [...]
> INFO [KRaftMigrationDriver id=1000] Ignoring image 
> MetadataProvenance(lastContainedOffset=4127, lastContainedEpoch=5, 
> lastContainedLogTimeMs=1721133091831) which does not contain a superset of 
> the metadata in ZK. Staying in SYNC_KRAFT_TO_ZK until a newer image is loaded 
> (org.apache.kafka.metadata.migration.KRaftMigrationDriver) {code}
> The only way to recover the cluster is revert everything stopping clusters, 
> removing /controller from zookeeper and restore the Zookeeper only 
> configuration in the brokers. A cleanup of the controller is necessary too.
> The migration never starts and the controllers never understand that they 
> have to migrate the data from Zookeeper. More than that, the new controller 
> claims to be the CONTROLLER but it refuses to be it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-17146) ZK to KRAFT migration stuck in pre-migration mode

Reply via email to