Song Ziyang created RATIS-1699:
----------------------------------

             Summary: Data race condition detected in LeaderStateImpl
                 Key: RATIS-1699
                 URL: https://issues.apache.org/jira/browse/RATIS-1699
             Project: Ratis
          Issue Type: Bug
    Affects Versions: 2.3.0
            Reporter: Song Ziyang
         Attachments: error_stack.log, related_log.log

Hi, we observed abnormal behaviors of GrpcLogAppender in a recent run: It keeps 
*restarting* and *encountering IllegalArgumentException* and *restarting.* 
Please refer to the attachment error_stack.log for more details.

It appears that the SegmentRaftLog is closed, but the GrpcLogAppender is still 
trying to read the log, and thus cause the IllegalArgumentException.

After carefully inspecting the related code, we came up with a possible event 
sequence that lead to this *forever loop of restarting.*

 

The LeaderStateImpl method 
restart([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532|https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532)])and
 
stop([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L326])
 have no synchronization coordinations, so they may cause potential race 
conditions.

Consider the following event sequence:
 # GrpcLogAppender encounters a OOM error, and transitions to EXCEPTION state, 
then call LeaderStateImpl's restart() method to recover. 
([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDaemon.java#L90])
 # StateMachine throws a OOM, which propagates to StateMachineUpdater, and 
cause the RaftServerImpl to close. 
([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java#L197])
 # The RaftServerImpl.close in turn calls shutdownLeaderState and closes all 
logAppenders. 
([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L329])
 # When step 3 returns, RaftServerImpl assumes all logAppenders are closed, 
thus calls SegmentRaftLog.close() to safely close the Raft Log. 
([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L416])
 # *{color:#de350b}Here comes the data race.{color}* The restart() in step 1 
keeps executing in parallel and starts a new LogAppender. This leads to a 
conflict: RaftServerimpl thinks it closes all logAppenders, but there is one 
leaking out. When the newly restarted LogAppender tries to read RaftLog, it 
encountered IllegalArgumentException, as the RaftLog is already closed by 
RaftServerImpl. *It then transitions to EXCEPTION and restarts again.*
 # Step 5 loops over and over again, leading to overwhelming error logs and 
system livelocks.

To confirm this, we read the log throughly again. The existing logs shows a 
similar event sequence. Please refer to related_log.log

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to