[ 
https://issues.apache.org/jira/browse/RATIS-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Song Ziyang resolved RATIS-1699.
--------------------------------
    Resolution: Not A Problem

This error occurs rarely. It is more of a cascading error of OOM.

> Data race condition detected in LeaderStateImpl
> -----------------------------------------------
>
>                 Key: RATIS-1699
>                 URL: https://issues.apache.org/jira/browse/RATIS-1699
>             Project: Ratis
>          Issue Type: Bug
>    Affects Versions: 2.3.0
>            Reporter: Song Ziyang
>            Priority: Major
>         Attachments: error_stack.log, related_log.log
>
>
> Hi, we observed abnormal behaviors of GrpcLogAppender in a recent run: It 
> keeps *restarting* and *encountering IllegalArgumentException* and 
> *restarting.* Please refer to the attachment error_stack.log for more details.
> It appears that the SegmentRaftLog is closed, but the GrpcLogAppender is 
> still trying to read the log, and thus cause the IllegalArgumentException.
> After carefully inspecting the related code, we came up with a possible event 
> sequence that lead to this *forever loop of restarting.*
>  
> The LeaderStateImpl method 
> restart([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532|https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532)])and
>  
> stop([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L326])
>  have no synchronization coordinations, so they may cause potential race 
> conditions.
> Consider the following event sequence:
>  # GrpcLogAppender encounters a OOM error, and transitions to EXCEPTION 
> state, then call LeaderStateImpl's restart() method to recover. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDaemon.java#L90])
>  # StateMachine throws a OOM, which propagates to StateMachineUpdater, and 
> cause the RaftServerImpl to close. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java#L197])
>  # The RaftServerImpl.close in turn calls shutdownLeaderState and closes all 
> logAppenders. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L329])
>  # When step 3 returns, RaftServerImpl assumes all logAppenders are closed, 
> thus calls SegmentRaftLog.close() to safely close the Raft Log. 
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L416])
>  # *{color:#de350b}Here comes the data race.{color}* The restart() in step 1 
> keeps executing in parallel and starts a new LogAppender. This leads to a 
> conflict: RaftServerimpl thinks it closes all logAppenders, but there is one 
> leaking out. When the newly restarted LogAppender tries to read RaftLog, it 
> encountered IllegalArgumentException, as the RaftLog is already closed by 
> RaftServerImpl. *It then transitions to EXCEPTION and restarts again.*
>  # Step 5 loops over and over again, leading to overwhelming error logs and 
> system livelocks.
> To confirm this, we read the log throughly again. The existing logs shows a 
> similar event sequence. Please refer to related_log.log
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to