[
https://issues.apache.org/jira/browse/RATIS-1699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Song Ziyang resolved RATIS-1699.
--------------------------------
Resolution: Not A Problem
This error occurs rarely. It is more of a cascading error of OOM.
> Data race condition detected in LeaderStateImpl
> -----------------------------------------------
>
> Key: RATIS-1699
> URL: https://issues.apache.org/jira/browse/RATIS-1699
> Project: Ratis
> Issue Type: Bug
> Affects Versions: 2.3.0
> Reporter: Song Ziyang
> Priority: Major
> Attachments: error_stack.log, related_log.log
>
>
> Hi, we observed abnormal behaviors of GrpcLogAppender in a recent run: It
> keeps *restarting* and *encountering IllegalArgumentException* and
> *restarting.* Please refer to the attachment error_stack.log for more details.
> It appears that the SegmentRaftLog is closed, but the GrpcLogAppender is
> still trying to read the log, and thus cause the IllegalArgumentException.
> After carefully inspecting the related code, we came up with a possible event
> sequence that lead to this *forever loop of restarting.*
>
> The LeaderStateImpl method
> restart([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532|https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L532)])and
>
> stop([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L326])
> have no synchronization coordinations, so they may cause potential race
> conditions.
> Consider the following event sequence:
> # GrpcLogAppender encounters a OOM error, and transitions to EXCEPTION
> state, then call LeaderStateImpl's restart() method to recover.
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/leader/LogAppenderDaemon.java#L90])
> # StateMachine throws a OOM, which propagates to StateMachineUpdater, and
> cause the RaftServerImpl to close.
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/StateMachineUpdater.java#L197])
> # The RaftServerImpl.close in turn calls shutdownLeaderState and closes all
> logAppenders.
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L329])
> # When step 3 returns, RaftServerImpl assumes all logAppenders are closed,
> thus calls SegmentRaftLog.close() to safely close the Raft Log.
> ([https://github.com/apache/ratis/blob/117b333114daf37e3d7a31919e49f8fdc0d9e201/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L416])
> # *{color:#de350b}Here comes the data race.{color}* The restart() in step 1
> keeps executing in parallel and starts a new LogAppender. This leads to a
> conflict: RaftServerimpl thinks it closes all logAppenders, but there is one
> leaking out. When the newly restarted LogAppender tries to read RaftLog, it
> encountered IllegalArgumentException, as the RaftLog is already closed by
> RaftServerImpl. *It then transitions to EXCEPTION and restarts again.*
> # Step 5 loops over and over again, leading to overwhelming error logs and
> system livelocks.
> To confirm this, we read the log throughly again. The existing logs shows a
> similar event sequence. Please refer to related_log.log
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)