Hi Snehasish, Since you already have a test, could you share the code change? You may attach a patch file or create a pull request. I will run it to reproduce the failure.
In the meantime, I will try to understand the details you provided. Tsz-Wo On Thu, Mar 5, 2026 at 3:14 AM Snehasish Roy <[email protected]> wrote: > Hi Tsz-Wo, > > Thank you for your prompt response. I was able to reproduce this issue > using CounterStateMachine. > > I added an utility in the CounterClient to trigger a snapshot. > > ``` > private void takeSnapshot() throws IOException { > RaftClientReply raftClientReply = client.getSnapshotManagementApi() > .create(true, 30_000); > System.out.println(raftClientReply); > } > ``` > > Once the snapshot is triggered, I move it to a different directory to > simulate clean restart. > > I also updated the SimpleStateMachineStorage::loadLatestSnapshot() to look > for snapshots in a different directory. > > ``` > public SingleFileSnapshotInfo loadLatestSnapshot() { > final File dir = new File("/tmp/snapshots"); > } > ``` > > Full steps for reproduction > 1. I started a 3 Node CounterServer and performed some updates to the state > machine using the CounterClient. > > 2. Triggered the snapshot via the CounterClient and then moved the snapshot > to a different directory - the snapshot will be of the format term_index. > Here the term will initially be 1, and let's assume the index is at 10. > > 3. Kill the leader, the term would have increased to 2. > > 4. Perform some updates and trigger another snapshot. Let's assume the > index is at 20 and the term is at 2. Moved the snapshot to a different > directory. > > 5. Stopped all nodes. Cleared all storage directories of all the nodes to > simulate clean restart. > > 6. Start 3 node CounterServer and observe the failure at the startup. > > ``` > 026-03-05 15:48:56 INFO SimpleStateMachineStorage:229 - Latest snapshot is > SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20] in > /tmp/snapshots > 2026-03-05 15:48:56 INFO SimpleStateMachineStorage:229 - Latest snapshot > is SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20] in > /tmp/snapshots > 2026-03-05 15:48:56 INFO RaftServerConfigKeys:62 - > raft.server.log.use.memory = false (default) > 2026-03-05 15:48:56 INFO RaftServer$Division:155 - n0@group-ABB3109A44C1: > getLatestSnapshot(CounterStateMachine-1:n0:group-ABB3109A44C1) returns > SingleFileSnapshotInfo(t:2, i:20):[/tmp/snapshots/snapshot.2_20] > 2026-03-05 15:48:56 INFO RaftLog:90 - > n0@group-ABB3109A44C1-SegmentedRaftLog: snapshotIndexFromStateMachine = 20 > .... > 2026-03-05 15:49:02 INFO RaftServer$Division:577 - n1@group-ABB3109A44C1: > set firstElectionSinceStartup to false for becomeLeader > 2026-03-05 15:49:02 INFO RaftServer$Division:278 - n1@group-ABB3109A44C1: > change Leader from null to n1 at term 1 for becomeLeader, leader elected > after 672ms > 2026-03-05 15:49:02 INFO SegmentedRaftLogWorker:440 - > n1@group-ABB3109A44C1-SegmentedRaftLogWorker: Starting segment from > index:21 > 2026-03-05 15:49:02 INFO SegmentedRaftLogWorker:647 - > n1@group-ABB3109A44C1-SegmentedRaftLogWorker: created new log segment > /ratis/./n1/02511d47-d67c-49a3-9011-abb3109a44c1/current/log_inprogress_21 > .... > 2026-03-05 15:49:02 INFO RaftServer$Division:309 - Leader > n1@group-ABB3109A44C1-LeaderStateImpl is ready since appliedIndex == > startIndex == 21 > 2026-03-05 15:49:02 ERROR StateMachineUpdater:207 - > n1@group-ABB3109A44C1-StateMachineUpdater caught a Throwable. > 2026-03-05 15:49:02 ERROR StateMachineUpdater:207 - > n1@group-ABB3109A44C1-StateMachineUpdater caught a Throwable. > java.lang.IllegalStateException: n1: Failed updateLastAppliedTermIndex: > newTI = (t:1, i:21) < oldTI = (t:2, i:20) > at org.apache.ratis.util.Preconditions.assertTrue(Preconditions.java:77) > at > > org.apache.ratis.statemachine.impl.BaseStateMachine.updateLastAppliedTermIndex(BaseStateMachine.java:148) > at > > org.apache.ratis.statemachine.impl.BaseStateMachine.updateLastAppliedTermIndex(BaseStateMachine.java:139) > at > > org.apache.ratis.statemachine.impl.BaseStateMachine.notifyTermIndexUpdated(BaseStateMachine.java:135) > at > > org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1893) > at > > org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:255) > at > > org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:194) > at java.base/java.lang.Thread.run(Thread.java:1575) > 2026-03-05 15:49:02 INFO RaftServer$Division:528 - n1@group-ABB3109A44C1: > shutdown > ``` > > As you can see from the stack trace, during the snapshot restore, the > termIndex was updated to the latest value seen from the snapshot 2:20, but > when the server was started from a clean slate, then the term was reset to > 1 by the RaftServerImpl at the startup. It then tries to update the log > entries and fails because of the precondition check that the term should be > monotonically increasing in the log entries. > > Please let me know if you need more information. > > Regards > > On Wed, 4 Mar 2026 at 06:33, Tsz Wo Sze <[email protected]> wrote: > > > Hi Snehasish, > > > > > ... newTI = (t:1, i:21) ... > > > > The newTI was invalid. It probably was from the state machine. It > should > > just use the TermIndex from LogEntryProto. See CounterStateMachine [1] > as > > an example. > > > > Tsz-Wo > > [1] > > > > > https://github.com/apache/ratis/blob/3d9f5af376409de7e635bb67c7dfbeadc882c413/ratis-examples/src/main/java/org/apache/ratis/examples/counter/server/CounterStateMachine.java#L263-L266 > > > > On Tue, Mar 3, 2026 at 10:52 AM Snehasish Roy via dev < > > [email protected]> > > wrote: > > > > > Hello everyone, > > > > > > I was exploring the snapshot restore capability of Ratis and found one > > > scenario that failed. > > > > > > 1. Start a 3 Node ratis cluster and perform some updates to the state > > > machine. > > > 2. Take the snapshot - the snapshot will be of the format term_index. > > Here > > > the term will initially be 1, and let's assume the index is at 10. > > > 3. Kill the leader, the term would have increased to 2. > > > 4. Perform some updates and trigger another snapshot. Let's assume the > > > index is at 20 and term is at 2. > > > 5. Stop all nodes. > > > 6. A failure is observed while starting the node. > > > > > > ``` > > > Failed updateLastAppliedTermIndex: newTI = (t:1, i:21) < oldTI = (t:2, > > > i:20) > > > ``` > > > > > > Based on the error logs, I suspect the state machine updated the last > > > applied term index to t:2, i:20, but the ServerState has a separate > > > variable for tracking the currentTerm which is initialized to 0 at > > startup. > > > Once the leader is elected, it tried to update the log entry but the > > update > > > failed due to precondition check. > > > > > > What's the correct way to solve this problem? Should the term be reset > > to 0 > > > while loading the snapshot at the server startup? > > > > > > References: > > > > > > > > > https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/ServerState.java#L82 > > > > > > > > > https://github.com/apache/ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/statemachine/impl/BaseStateMachine.java#L138 > > > > > > Thank you for looking into this issue. > > > > > > > > > Regards, > > > Snehasish > > > > > >
