[ 
https://issues.apache.org/jira/browse/KAFKA-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040612#comment-18040612
 ] 

Federico Valeri commented on KAFKA-19905:
-----------------------------------------

The reconnection loop goes on for exactly 5 minutes, which is the shutdown 
timeout hard coded in KafkaBroker trait.

This is what I have from another test logs for one of the brokers:

- SIGTERM received: 14:39:46,282
- Actual shutdown completed: 14:44:46,385
- Time elapsed: 5 minutes and 0.103 seconds (approximately 5 minutes)


> Tight reconnection loop during shutdown
> ---------------------------------------
>
>                 Key: KAFKA-19905
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19905
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 4.1.1
>            Reporter: Federico Valeri
>            Assignee: Federico Valeri
>            Priority: Major
>         Attachments: Screenshot From 2025-11-25 14-40-25.png, test.zip
>
>
> During shutdown, nodes 1 and 2 (brokers) are stuck in an infinite loop trying 
> to connect to node 0 (the controller) every 50ms. The issue is time 
> sensitive, but it can be reproduced easily shutting down all nodes at the 
> same time.
> The problem is that even during shutdown, the NodeToControllerRequestThread 
> continues to run. The RaftControllerNodeProvider still returns node 0 as the 
> controller from cached Raft metadata, but node 0 has already terminated 
> (NodeToControllerChannelManager:323).
> Looking at logs, the controller shut down at 12:31:38 while brokers were 
> still in controlled shutdown. The sequence shows:
> 1. Node 1 and 2 request controlled shutdown
> 2. Controller grants the shutdown
> 3. Controller itself shuts down (RaftManager shutdown at 12:31:38)
> 4. Node 1 and 2 continue trying to heartbeat to the now-dead controller
> 5. They get stuck in this reconnection loop because the 
> NodeToControllerRequestThread is still running and hasn't been shut down 
> properly
> {code}
> [2025-11-21 12:31:38,515] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,515] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,566] INFO [NodeToControllerChannelManager id=2 
> name=heartbeat] Node 0 disconnected. (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] WARN [NodeToControllerChannelManager id=2 
> name=heartbeat] Connection to node 0 (localhost/127.0.0.1:9090) could not be 
> established. Node may not be available. 
> (org.apache.kafka.clients.NetworkClient)
> [2025-11-21 12:31:38,567] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> [2025-11-21 12:31:38,616] INFO 
> [broker-2-to-controller-heartbeat-channel-manager]: Recorded new KRaft 
> controller, from now on will use node localhost:9090 (id: 0 rack: null 
> isFenced: false) (kafka.server.NodeToControllerRequestThread)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to