[ 
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Federico Valeri updated KAFKA-19905:
------------------------------------
    Description: 
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time (i.e. 
pkill -SIGTERM kafka.Kafka when running them on the same machine).

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}




  was:
During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
to connect to node 0 (the controller) every 50ms. The issue is time sensitive, 
but it can be reproduced easily shutting down all nodes at the same time.

The problem is that even during shutdown, the NodeToControllerRequestThread 
continues to run. The RaftControllerNodeProvider still returns node 0 as the 
controller from cached Raft metadata, but node 0 has already terminated 
(NodeToControllerChannelManager:323).

Looking at logs, the controller shut down at 12:31:38 while brokers were still 
in controlled shutdown. The sequence shows:

1. Node 1 and 2 request controlled shutdown
2. Controller grants the shutdown
3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
5. They get stuck in this reconnection loop because the 
NodeToControllerRequestThread is still running and hasn't been shut down 
properly

{code}
[2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,515] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
established. Node may not be available. (org.apache.kafka.clients.NetworkClient)
[2025-11-21 12:31:38,567] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
[2025-11-21 12:31:38,616] INFO 
[broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
controller, from now on will use node localhost:9090 (id: 0 rack: null 
isFenced: false) (kafka.server.NodeToControllerRequestThread)
{code}





> Tight reconnection loop during shutdown
> ---------------------------------------
>
>                 Key: KAFKA-19905
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19905
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.1.1
>            Reporter: Federico Valeri
>            Assignee: Federico Valeri
>            Priority: Major
>         Attachments: Screenshot From 2025-11-25 14-40-25.png, test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
> to connect to node 0 (the controller) every 50ms. The issue is time 
> sensitive, but it can be reproduced easily shutting down all nodes at the 
> same time (i.e. pkill -SIGTERM kafka.Kafka when running them on the same 
> machine).
> The problem is that even during shutdown, the NodeToControllerRequestThread 
> continues to run. The RaftControllerNodeProvider still returns node 0 as the 
> controller from cached Raft metadata, but node 0 has already terminated 
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were 
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the 
> NodeToControllerRequestThread is still running and hasn't been shut down 
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to