[ 
https://issues.apache.org/jira/browse/RATIS-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601837#comment-17601837
 ] 

Song Ziyang commented on RATIS-1699:
------------------------------------

[~szetszwo] Indeed the OOM error is the root cause for this cascading disaster 
and it is a corner case. Fortunately the RaftLog is closed and we don't have to 
worry for data lost. However, the leaked GrpcLogAppender thread  keeps 
restarting and printing out error logs, causing the JVM process unable to exit, 
which is confusing at first glance. 

> Data race condition detected in LeaderStateImpl
> -----------------------------------------------
>
>                 Key: RATIS-1699
>                 URL: https://issues.apache.org/jira/browse/RATIS-1699
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Song Ziyang
>            Priority: Major
>         Attachments: error_stack.log, related_log.log
>
>
> Hi, we observed abnormal behaviors of GrpcLogAppender in a recent run: It 
> keeps *restarting* and *encountering IllegalArgumentException* and 
> *restarting.* Please refer to the attachment error_stack.log for more details.
> It appears that the SegmentRaftLog is closed, but the GrpcLogAppender is 
> still trying to read the log, and thus cause the IllegalArgumentException.
> After carefully inspecting the related code, we came up with a possible event 
> sequence that lead to this *forever loop of restarting.*
>  
> The LeaderStateImpl method 
> restart([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532|https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532)])and
>  
> stop([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L326])
>  have no synchronization coordinations, so they may cause potential race 
> conditions.
> Consider the following event sequence:
>  # GrpcLogAppender encounters a OOM error, and transitions to EXCEPTION 
> state, then call LeaderStateImpl's restart() method to recover. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDaemon.java#L90])
>  # StateMachine throws a OOM, which propagates to StateMachineUpdater, and 
> cause the RaftServerImpl to close. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java#L197])
>  # The RaftServerImpl.close in turn calls shutdownLeaderState and closes all 
> logAppenders. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L329])
>  # When step 3 returns, RaftServerImpl assumes all logAppenders are closed, 
> thus calls SegmentRaftLog.close() to safely close the Raft Log. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L416])
>  # *{color:#de350b}Here comes the data race.{color}* The restart() in step 1 
> keeps executing in parallel and starts a new LogAppender. This leads to a 
> conflict: RaftServerimpl thinks it closes all logAppenders, but there is one 
> leaking out. When the newly restarted LogAppender tries to read RaftLog, it 
> encountered IllegalArgumentException, as the RaftLog is already closed by 
> RaftServerImpl. *It then transitions to EXCEPTION and restarts again.*
>  # Step 5 loops over and over again, leading to overwhelming error logs and 
> system livelocks.
> To confirm this, we read the log throughly again. The existing logs shows a 
> similar event sequence. Please refer to related_log.log
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to