dennis lucero created KAFKA-19318: ------------------------------------- Summary: Zookeeper-Kraft failing migration - RPC got timed out before it could be sent Key: KAFKA-19318 URL: https://issues.apache.org/jira/browse/KAFKA-19318 Project: Kafka Issue Type: Bug Components: kraft Affects Versions: 3.6.1, 3.6.2, 3.7.0 Reporter: dennis lucero Fix For: 3.7.1
Despite several attempts to migrate from Zookeeper cluster to Kraft, it failed to properly migrate. We spawn a need cluster fully healthy with 3 Kafka nodes connected to 3 Zookeeper nodes. 3 new Kafka nodes are there for the new controllers. It was tested with Kafka 3.6.1, 3.6.2 and 3.7.0. it might be linked to KAFKA-15330. The controllers are started without issue. When the brokers are then configured for the migration, the migration is not starting. Once the last broker is restarted, we got the following logs. {code:java} [2024-06-03 15:11:48,192] INFO [ReplicaFetcherThread-0-11]: Stopped (kafka.server.ReplicaFetcherThread) [2024-06-03 15:11:48,193] INFO [ReplicaFetcherThread-0-11]: Shutdown completed (kafka.server.ReplicaFetcherThread) {code} Then we only get the following every 30s {code:java} [2024-06-03 15:12:04,163] INFO [BrokerLifecycleManager id=12 isZkBroker=true] Unable to register the broker because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager) [2024-06-03 15:12:34,297] INFO [BrokerLifecycleManager id=12 isZkBroker=true] Unable to register the broker because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager) [2024-06-03 15:13:04,536] INFO [BrokerLifecycleManager id=12 isZkBroker=true] Unable to register the broker because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager){code} The config on the controller node is the following {code:java} kafka0202e1 ~]$ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties | grep -v password | sort advertised.host.name=kafka0202e1.ahub.sb.eu.ginfra.net broker.rack=e1 controller.listener.names=CONTROLLER controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093 default.replication.factor=3 delete.topic.enable=false group.initial.rebalance.delay.ms=3000 inter.broker.protocol.version=3.7 listeners=CONTROLLER://kafka0202e1.ahub.sb.eu.ginfra.net:9093 listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL log.dirs=/data/kafka log.message.format.version=3.6 log.retention.check.interval.ms=300000 log.retention.hours=240 log.segment.bytes=1073741824 min.insync.replicas=2 node.id=20 num.io.threads=8 num.network.threads=3 num.partitions=1 num.recovery.threads.per.data.dir=1 offsets.topic.replication.factor=3 process.roles=controller security.inter.broker.protocol=SSL socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 socket.send.buffer.bytes=102400 ssl.cipher.suites=TLS_AES_256_GCM_SHA384 ssl.client.auth=required ssl.enabled.protocols=TLSv1.3 ssl.endpoint.identification.algorithm=HTTPS ssl.keystore.location=/etc/kafka/ssl/keystore.ts ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=/etc/kafka/ssl/truststore.ts transaction.state.log.min.isr=3 transaction.state.log.replication.factor=3 unclean.leader.election.enable=false zookeeper.connect=10.135.65.199:2181,10.133.65.199:2181,10.137.64.56:2181, zookeeper.metadata.migration.enable=true {code} The config on the broker node is the following {code} $ sudo grep -v '^\s*$\|^\s*\#' /etc/kafka/server.properties | grep -v password | sort advertised.host.name=kafka0201e3.ahub.sb.eu.ginfra.net advertised.listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092 broker.id=12 broker.rack=e3 controller.listener.names=CONTROLLER # added once all controllers were started controller.quorum.voters=2...@kafka0202e1.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e2.ahub.sb.eu.ginfra.net:9093,2...@kafka0202e3.ahub.sb.eu.ginfra.net:9093 # added once all controllers were started default.replication.factor=3 delete.topic.enable=false group.initial.rebalance.delay.ms=3000 inter.broker.protocol.version=3.7 listener.security.protocol.map=CONTROLLER:SSL,PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL listeners=SSL://kafka0201e3.ahub.sb.eu.ginfra.net:9092 log.dirs=/data/kafka log.retention.check.interval.ms=300000 log.retention.hours=240 log.segment.bytes=1073741824 min.insync.replicas=2 num.io.threads=8 num.network.threads=3 num.partitions=1 num.recovery.threads.per.data.dir=1 offsets.topic.replication.factor=3 security.inter.broker.protocol=SSL socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 socket.send.buffer.bytes=102400 ssl.cipher.suites=TLS_AES_256_GCM_SHA384 ssl.client.auth=required ssl.enabled.protocols=TLSv1.3 ssl.endpoint.identification.algorithm=HTTPS ssl.keystore.location=/etc/kafka/ssl/keystore.ts ssl.keystore.type=JKS ssl.secure.random.implementation=SHA1PRNG ssl.truststore.location=/etc/kafka/ssl/truststore.ts transaction.state.log.min.isr=3 transaction.state.log.replication.factor=3 unclean.leader.election.enable=false zookeeper.connect=10.133.65.199:2181,10.135.65.199:2181,10.137.64.56:2181, zookeeper.connection.timeout.ms=6000 zookeeper.metadata.migration.enable=true # added once all controllers were started {code} When trying to move to the next step (`Migrating brokers to KRaft`), it fails to get controller quorum and crashes. {code} [2024-06-03 15:33:21,553] INFO [BrokerLifecycleManager id=12] Unable to register the broker because the RPC got timed out before it could be sent. (kafka.server.BrokerLifecycleManager) [2024-06-03 15:33:32,549] ERROR [BrokerLifecycleManager id=12] Shutting down because we were unable to register with the controller quorum. (kafka.server.BrokerLifecycleManager) [2024-06-03 15:33:32,550] INFO [BrokerLifecycleManager id=12] Transitioning from STARTING to SHUTTING_DOWN. (kafka.server.BrokerLifecycleManager) [2024-06-03 15:33:32,551] INFO [broker-12-to-controller-heartbeat-channel-manager]: Shutting down (kafka.server.NodeToControllerRequestThread) [2024-06-03 15:33:32,551] INFO [broker-12-to-controller-heartbeat-channel-manager]: Shutdown completed (kafka.server.NodeToControllerRequestThread) [2024-06-03 15:33:32,551] ERROR [BrokerServer id=12] Received a fatal error while waiting for the controller to acknowledge that we are caught up (kafka.server.BrokerServer) java.util.concurrent.CancellationException {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)