[ 
https://issues.apache.org/jira/browse/RATIS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuuka updated RATIS-2162:
-------------------------
    Description: 
This is the reason for the jira 2161 problem.
RATIS-2161 Grpc may spawn many threads - ASF JIRA (apache.org)

1. Old Leader S receives larger term number and convert to follower.  
2. LogAppender thread L did not receive the shutdown signal in time due to 
abnormal triggering of restart
3. S will hold the ‘server’ lock and wait for L to shut down
4. L triggers snapshot sending, calls newSnapshotRequests5. In 
newSnapshotRequests, L will acquire the ‘server’ lock
 


This eventually leads to a deadlock, grpc cannot reclaim the thread in time, 
and then the problem of jira 2161 occurs

                                                                     stop 
LogAppender L
close LeaderState                                                |
timeline.  --------------------------------------
                 |                            -----------------------       
logAppender L TimeLine
           shutdown                    |                                  |
       LeaderState                restart                 
newInstallSnapshotRequests
                                      logAppender         
 
 
I think it is possible to check the status of raft every time LogAppender is 
awakened, and close it if it is not currently the leader

In LeaderStateImpl, there is another concurrency safety issue regarding 
senderList.
removeSenders, addSenders, stopAll may be accessed by multiple threads.

For example, thread t1 creates a futures array with a size of 3 in stopAll, and 
then thread t2 calls removeSenders, which may cause out-of-bounds access 
because future.length is 3, but senders .size () < 3.

  was:
This is the reason for the jira 2161 problem.
[RATIS-2161] Grpc may spawn many threads - ASF JIRA (apache.org)


1. Old Leader S receives larger term number and convert to follower.  
2. LogAppender thread L did not receive the shutdown signal in time due to 
abnormal triggering of restart
3. S will hold the ‘server’ lock and wait for L to shut down
4. L triggers snapshot sending, calls newSnapshotRequests5. In 
newSnapshotRequests, L will acquire the ‘server’ lock
 
This eventually leads to a deadlock, grpc cannot reclaim the thread in time, 
and then the problem of jira 2161 occurs

                                                                     stop 
LogAppender L
close LeaderState                                                |
timeline.  --------------------------------------
                 |                            -----------------------       
logAppender L TimeLine
           shutdown                    |                                  |
       LeaderState                restart                 
newInstallSnapshotRequests
                                      logAppender         
 
 
I think it is possible to check the status of raft every time LogAppender is 
awakened, and close it if it is not currently the leader


In LeaderStateImpl, there is another concurrency safety issue regarding 
senderList.
removeSenders, addSenders, stopAll may be accessed by multiple threads.


For example, thread t1 creates a futures array with a size of 3 in stopAll, and 
then thread t2 calls removeSenders, which may cause out-of-bounds access 
because future.length is 3, but senders .size () < 3.


> When closing leaderState, if the logAppender thread sends a snapshot, a 
> deadlock may occur
> ------------------------------------------------------------------------------------------
>
>                 Key: RATIS-2162
>                 URL: https://issues.apache.org/jira/browse/RATIS-2162
>             Project: Ratis
>          Issue Type: Wish
>    Affects Versions: 3.1.0
>            Reporter: yuuka
>            Priority: Major
>
> This is the reason for the jira 2161 problem.
> RATIS-2161 Grpc may spawn many threads - ASF JIRA (apache.org)
> 1. Old Leader S receives larger term number and convert to follower.  
> 2. LogAppender thread L did not receive the shutdown signal in time due to 
> abnormal triggering of restart
> 3. S will hold the ‘server’ lock and wait for L to shut down
> 4. L triggers snapshot sending, calls newSnapshotRequests5. In 
> newSnapshotRequests, L will acquire the ‘server’ lock
>  
> This eventually leads to a deadlock, grpc cannot reclaim the thread in time, 
> and then the problem of jira 2161 occurs
>                                                                      stop 
> LogAppender L
> close LeaderState                                                |
> timeline.  --------------------------------------
>                  |                            -----------------------       
> logAppender L TimeLine
>            shutdown                    |                                  |
>        LeaderState                restart                 
> newInstallSnapshotRequests
>                                       logAppender         
>  
>  
> I think it is possible to check the status of raft every time LogAppender is 
> awakened, and close it if it is not currently the leader
> In LeaderStateImpl, there is another concurrency safety issue regarding 
> senderList.
> removeSenders, addSenders, stopAll may be accessed by multiple threads.
> For example, thread t1 creates a futures array with a size of 3 in stopAll, 
> and then thread t2 calls removeSenders, which may cause out-of-bounds access 
> because future.length is 3, but senders .size () < 3.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to