[
https://issues.apache.org/jira/browse/RATIS-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze resolved RATIS-2019.
-------------------------------
Fix Version/s: 3.1.0
(was: 3.0.1)
Resolution: Fixed
The pull request is now merged. Thanks, [~tanxinyu]!
> Fixed abnormal exit of StateMachineUpdater
> ------------------------------------------
>
> Key: RATIS-2019
> URL: https://issues.apache.org/jira/browse/RATIS-2019
> Project: Ratis
> Issue Type: Bug
> Reporter: Xinyu Tan
> Assignee: Xinyu Tan
> Priority: Major
> Fix For: 3.1.0
>
> Attachments: image-2024-01-29-11-36-17-263.png, screenshot-1.png,
> screenshot-2.png
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> In some scenarios where Ratis is restarted, we find that there is a certain
> probability of an error at startup.
> For case 1
> !image-2024-01-29-11-36-17-263.png!
>
>
> By looking through the code, I found
> [here|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L429]
> is a problem with the code
> StateMachineUpdater will call this line when applying any member change log
> from previous term if the Leader exists, but the startupEntry for the current
> term may not have been initialized yet, so the assertion will throw an error.
> We should only fire this assertion if the log matches the current term.
> In addition, I found that the current implementation triggers
> notifyLeaderReady several times in the member change log of the current term,
> which is not consistent with the semantics of this interface, because the
> Leader is always in the ready state
> For case 2
> !screenshot-2.png!
> !screenshot-1.png!
> I noticed that StateMachineUpdater fetching leaderState and raftserver
> changetoFollower are asynchronous. As shown in the log, StateMachineUpdater
> gets leaderStateImpl with term 179 during execution and executes checkReady,
> during which time it receives log requests with larger term. Update term to
> 180, set leaderStateImpl to null, and then run getCurrentTerm on
> leaderStateImpl with term 179. In this case, we should get the latest term
> directly from server.getState().getCurrentTerm() so that we don't get this
> error
--
This message was sent by Atlassian Jira
(v8.20.10#820010)