[
https://issues.apache.org/jira/browse/RATIS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884363#comment-17884363
]
Tsz-wo Sze commented on RATIS-2162:
-----------------------------------
[~tohsakarin__] , Thanks a lot for digging out the problem! Would you like to
provide a pull request?
> When closing leaderState, if the logAppender thread sends a snapshot, a
> deadlock may occur
> ------------------------------------------------------------------------------------------
>
> Key: RATIS-2162
> URL: https://issues.apache.org/jira/browse/RATIS-2162
> Project: Ratis
> Issue Type: Wish
> Affects Versions: 3.1.0
> Reporter: yuuka
> Priority: Major
> Attachments: image-2024-09-24-10-41-20-140.png,
> image-2024-09-24-10-43-34-812.png
>
>
> This is the reason for the jira 2161 problem.
> RATIS-2161 Grpc may spawn many threads - ASF JIRA (apache.org)
> 1. Old Leader S receives larger term number and convert to follower.
> 2. LogAppender thread L did not receive the shutdown signal in time due to
> abnormal triggering of restart
> 3. S will hold the ‘server’ lock and wait for L to shut down
> 4. L triggers snapshot sending, calls newSnapshotRequests5. In
> newSnapshotRequests, L will acquire the ‘server’ lock
> !image-2024-09-24-10-43-34-812.png!
> !image-2024-09-24-10-41-20-140.png!
> This eventually leads to a deadlock, grpc cannot reclaim the thread in time,
> and then the problem of jira 2161 occurs
> stop
> LogAppender L
> close LeaderState |
> timeline. --------------------------------------
> | -----------------------
> logAppender L TimeLine
> shutdown | |
> LeaderState restart
> newInstallSnapshotRequests
> logAppender
>
>
> I think it is possible to check the status of raft every time LogAppender is
> awakened, and close it if it is not currently the leader
>
>
> In addition, in LeaderStateImpl, there is another concurrency safety issue
> regarding senderList.
> removeSenders, addSenders, stopAll may be accessed by multiple threads.
> For example, thread t1 creates a futures array with a size of 3 in stopAll,
> and then thread t2 calls removeSenders, which may cause out-of-bounds access
> because future.length is 3, but senders .size () < 3.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)