[ 
https://issues.apache.org/jira/browse/RATIS-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tsz-wo Sze resolved RATIS-2019.
-------------------------------
    Fix Version/s: 3.1.0
                       (was: 3.0.1)
       Resolution: Fixed

The pull request is now merged. Thanks, [~tanxinyu]!

> Fixed abnormal exit of StateMachineUpdater
> ------------------------------------------
>
>                 Key: RATIS-2019
>                 URL: https://issues.apache.org/jira/browse/RATIS-2019
>             Project: Ratis
>          Issue Type: Bug
>            Reporter: Xinyu Tan
>            Assignee: Xinyu Tan
>            Priority: Major
>             Fix For: 3.1.0
>
>         Attachments: image-2024-01-29-11-36-17-263.png, screenshot-1.png, 
> screenshot-2.png
>
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> In some scenarios where Ratis is restarted, we find that there is a certain 
> probability of an error at startup.
> For case 1
> !image-2024-01-29-11-36-17-263.png!
>  
>  
> By looking through the code, I found 
> [here|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L429]
>  is a problem with the code
> StateMachineUpdater will call this line when applying any member change log 
> from previous term if the Leader exists, but the startupEntry for the current 
> term may not have been initialized yet, so the assertion will throw an error.
> We should only fire this assertion if the log matches the current term.
> In addition, I found that the current implementation triggers 
> notifyLeaderReady several times in the member change log of the current term, 
> which is not consistent with the semantics of this interface, because the 
> Leader is always in the ready state
> For case 2
>  !screenshot-2.png! 
>  !screenshot-1.png! 
> I noticed that StateMachineUpdater fetching leaderState and raftserver 
> changetoFollower are asynchronous. As shown in the log, StateMachineUpdater 
> gets leaderStateImpl with term 179 during execution and executes checkReady, 
> during which time it receives log requests with larger term. Update term to 
> 180, set leaderStateImpl to null, and then run getCurrentTerm on 
> leaderStateImpl with term 179. In this case, we should get the latest term 
> directly from server.getState().getCurrentTerm() so that we don't get this 
> error



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to