[ 
https://issues.apache.org/jira/browse/RATIS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953063#comment-17953063
 ] 

Ivan Andika commented on RATIS-2300:
------------------------------------

Alternatively, it might be that cn0 is machine is stuck (e.g. for example due 
to the memory issue causing GC). Recently we had an issue where a large user 
workload causes memory issue that are accompanied by HEARTBEAT timeout logs.

> New Leader Repeatedly Times Out When Sending RPCs to Old Leader, Causing 
> System Hang
> ------------------------------------------------------------------------------------
>
>                 Key: RATIS-2300
>                 URL: https://issues.apache.org/jira/browse/RATIS-2300
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Xinhao Gu
>            Assignee: Xinhao Gu
>            Priority: Major
>         Attachments: image-2025-05-20-12-06-57-540.png, 
> image-2025-05-20-12-11-08-832.png, image-2025-05-20-12-18-05-012.png, 
> image-2025-05-20-12-18-50-729.png
>
>
> h2. *Problem Description*
> I've encountered an issue where, after a leader election, the newly elected 
> leader consistently fails to communicate with the previous (old) leader. 
> Specifically, RPCs (such as heartbeats or appendEntries requests) sent from 
> the new leader to the old leader always time out.
> h2. *Scene Description*
> *Initial State:* Node {{cn0}} is the established leader of the Raft group.
> *Follower Timeout & Election:* At {{{}01:09:02,862{}}}, node {{cn1}} (a 
> follower) experiences an election timeout, presumably because it did not 
> receive timely heartbeats from the leader {{{}cn0{}}}.
> !image-2025-05-20-12-18-05-012.png|width=1699,height=78!
> *Candidacy:* {{cn1}} transitions to the candidate state and starts a new 
> leader election for a new term.
> *New Leader Elected:* Node {{cn2}} votes for {{{}cn1{}}}. Subsequently, 
> {{cn1}} successfully gathers enough votes and becomes the new leader of the 
> Raft group.
> !image-2025-05-20-12-18-50-729.png|width=1895,height=488!{*}Communication 
> Failure with Old Leader:{*} Following this, the new leader ({{{}cn1{}}}) 
> begins its attempts to manage the group. However, when {{cn1}} tries to send 
> RPCs (e.g., heartbeats/AppendEntries) to the _previous_ leader ({{{}cn0{}}}), 
> these attempts consistently time out.
> !image-2025-05-20-12-11-08-832.png|width=1803,height=201!
> !image-2025-05-20-12-06-57-540.png|width=2015,height=689!
>  
> h3. Follower‘s log
> During this period, the follower did not output any logs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to