[
https://issues.apache.org/jira/browse/RATIS-2300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17953063#comment-17953063
]
Ivan Andika commented on RATIS-2300:
------------------------------------
Alternatively, it might be that cn0 is machine is stuck (e.g. for example due
to the memory issue causing GC). Recently we had an issue where a large user
workload causes memory issue that are accompanied by HEARTBEAT timeout logs.
> New Leader Repeatedly Times Out When Sending RPCs to Old Leader, Causing
> System Hang
> ------------------------------------------------------------------------------------
>
> Key: RATIS-2300
> URL: https://issues.apache.org/jira/browse/RATIS-2300
> Project: Ratis
> Issue Type: Bug
> Reporter: Xinhao Gu
> Assignee: Xinhao Gu
> Priority: Major
> Attachments: image-2025-05-20-12-06-57-540.png,
> image-2025-05-20-12-11-08-832.png, image-2025-05-20-12-18-05-012.png,
> image-2025-05-20-12-18-50-729.png
>
>
> h2. *Problem Description*
> I've encountered an issue where, after a leader election, the newly elected
> leader consistently fails to communicate with the previous (old) leader.
> Specifically, RPCs (such as heartbeats or appendEntries requests) sent from
> the new leader to the old leader always time out.
> h2. *Scene Description*
> *Initial State:* Node {{cn0}} is the established leader of the Raft group.
> *Follower Timeout & Election:* At {{{}01:09:02,862{}}}, node {{cn1}} (a
> follower) experiences an election timeout, presumably because it did not
> receive timely heartbeats from the leader {{{}cn0{}}}.
> !image-2025-05-20-12-18-05-012.png|width=1699,height=78!
> *Candidacy:* {{cn1}} transitions to the candidate state and starts a new
> leader election for a new term.
> *New Leader Elected:* Node {{cn2}} votes for {{{}cn1{}}}. Subsequently,
> {{cn1}} successfully gathers enough votes and becomes the new leader of the
> Raft group.
> !image-2025-05-20-12-18-50-729.png|width=1895,height=488!{*}Communication
> Failure with Old Leader:{*} Following this, the new leader ({{{}cn1{}}})
> begins its attempts to manage the group. However, when {{cn1}} tries to send
> RPCs (e.g., heartbeats/AppendEntries) to the _previous_ leader ({{{}cn0{}}}),
> these attempts consistently time out.
> !image-2025-05-20-12-11-08-832.png|width=1803,height=201!
> !image-2025-05-20-12-06-57-540.png|width=2015,height=689!
>
> h3. Follower‘s log
> During this period, the follower did not output any logs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)