Federico Valeri created KAFKA-19905:
---------------------------------------

             Summary: Tight reconnection loop during shutdown
                 Key: KAFKA-19905
                 URL: https://issues.apache.org/jira/browse/KAFKA-19905
             Project: Kafka
          Issue Type: Bug
          Components: core
    Affects Versions: 4.1.1
            Reporter: Federico Valeri
         Attachments: test.zip

During clean shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop 
trying to connect to node 0 (the controller) every 50ms. The issue is time 
sensitive, but it can be reproduced easily shutting down all nodes at the same 
time.

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}

There are two solutions:

1. Make the NodeToControllerRequestThread interruptible 
(InterBrokerSendThread.isInterruptible=true)
2. Check isRunning() in the doWork() method before attempting controller 
operations

I went with solution 2 to avoid interrupting in-flight requests and could cause 
other issues.





--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to