[
https://issues.apache.org/jira/browse/RATIS-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xinyu Tan updated RATIS-2019:
-----------------------------
Description:
In some scenarios where Ratis is restarted, we find that there is a certain
probability of an error at startup.
For case 1
!image-2024-01-29-11-36-17-263.png!
By looking through the code, I found
[here|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L429]
is a problem with the code
StateMachineUpdater will call this line when applying any member change log
from previous term if the Leader exists, but the startupEntry for the current
term may not have been initialized yet, so the assertion will throw an error.
We should only fire this assertion if the log matches the current term.
In addition, I found that the current implementation triggers notifyLeaderReady
several times in the member change log of the current term, which is not
consistent with the semantics of this interface, because the Leader is always
in the ready state
For case 2
!screenshot-1.png!
!screenshot-2.png!
I noticed that StateMachineUpdater fetching leaderState and raftserver
changetoFollower are asynchronous. As shown in the log, StateMachineUpdater
gets leaderStateImpl with term 179 during execution and executes checkReady,
during which time it receives log requests with larger term. Update term to
180, set leaderStateImpl to null, and then run getCurrentTerm on
leaderStateImpl with term 179. In this case, we should get the latest term
directly from server.getState().getCurrentTerm() so that we don't get this error
was:
In some scenarios where Ratis is restarted, we find that there is a certain
probability of an error at startup.
!image-2024-01-29-11-36-17-263.png!
By looking through the code, I found
[here|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L429]
is a problem with the code
StateMachineUpdater will call this line when applying any member change log
from previous term if the Leader exists, but the startupEntry for the current
term may not have been initialized yet, so the assertion will throw an error.
We should only fire this assertion if the log matches the current term.
In addition, I found that the current implementation triggers notifyLeaderReady
several times in the member change log of the current term, which is not
consistent with the semantics of this interface, because the Leader is always
in the ready state
> Fixed abnormal exit of StateMachineUpdater
> ------------------------------------------
>
> Key: RATIS-2019
> URL: https://issues.apache.org/jira/browse/RATIS-2019
> Project: Ratis
> Issue Type: Bug
> Reporter: Xinyu Tan
> Assignee: Xinyu Tan
> Priority: Major
> Fix For: 3.0.1
>
> Attachments: image-2024-01-29-11-36-17-263.png, screenshot-1.png,
> screenshot-2.png
>
> Time Spent: 40m
> Remaining Estimate: 0h
>
> In some scenarios where Ratis is restarted, we find that there is a certain
> probability of an error at startup.
> For case 1
> !image-2024-01-29-11-36-17-263.png!
>
>
> By looking through the code, I found
> [here|https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/LeaderStateImpl.java#L429]
> is a problem with the code
> StateMachineUpdater will call this line when applying any member change log
> from previous term if the Leader exists, but the startupEntry for the current
> term may not have been initialized yet, so the assertion will throw an error.
> We should only fire this assertion if the log matches the current term.
> In addition, I found that the current implementation triggers
> notifyLeaderReady several times in the member change log of the current term,
> which is not consistent with the semantics of this interface, because the
> Leader is always in the ready state
> For case 2
> !screenshot-1.png!
> !screenshot-2.png!
> I noticed that StateMachineUpdater fetching leaderState and raftserver
> changetoFollower are asynchronous. As shown in the log, StateMachineUpdater
> gets leaderStateImpl with term 179 during execution and executes checkReady,
> during which time it receives log requests with larger term. Update term to
> 180, set leaderStateImpl to null, and then run getCurrentTerm on
> leaderStateImpl with term 179. In this case, we should get the latest term
> directly from server.getState().getCurrentTerm() so that we don't get this
> error
--
This message was sent by Atlassian Jira
(v8.20.10#820010)