Federico Valeri created KAFKA-19905:
---------------------------------------
Summary: Tight reconnection loop during shutdown
Key: KAFKA-19905
URL: https://issues.apache.org/jira/browse/KAFKA-19905
Project: Kafka
Issue Type: Bug
Components: core
Affects Versions: 4.1.1
Reporter: Federico Valeri
Attachments: test.zip
During clean shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop
trying to connect to node 0 (the controller) every 50ms. The issue is time
sensitive, but it can be reproduced easily shutting down all nodes at the same
time.
The problem is that even during shutdown, the NodeToControllerRequestThread
continues to run. The RaftControllerNodeProvider still returns node 0 as the
controller from cached Raft metadata, but node 0 has already terminated
(NodeToControllerChannelManager:323).
Looking at logs, the controller shut down at 12:31:38 while brokers were still
in controlled shutdown. The sequence shows:
1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the
NodeToControllerRequestThread is still running and hasn't been shut down
properly
{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft
controller, from now on will use node localhost:9090 (id: 0 rack: null
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}
There are two solutions:
1. Make the NodeToControllerRequestThread interruptible
(InterBrokerSendThread.isInterruptible=true)
2. Check isRunning() in the doWork() method before attempting controller
operations
I went with solution 2 to avoid interrupting in-flight requests and could cause
other issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)