leixm commented on PR #2547: URL: https://github.com/apache/celeborn/pull/2547#issuecomment-2154054970
``` follower log 24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine. 24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:28, i:1185003348) 24/05/23 03:06:53,439 WARN [master-state-machine-executor] AbstractMetaManager: Celeborn cluster estimated partition size changed from 129.2 MiB to 110.5 MiB 24/05/23 03:08:01,425 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404881416_5912151569428133017.dat. 24/05/23 03:08:01,428 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 184db806c56d5780f92c98fd7d3fb35a for snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003348. 24/05/23 03:08:01,428 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404881416_5912151569428133017.dat to /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003348. 24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine. 24/05/23 03:08:02,421 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:28, i:1185003348) leader log 24/05/23 03:08:02,117 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404882108_1183932657463440669.dat. 24/05/23 03:08:02,120 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 232b855f8ca673b7a2c38a8ec90c6306 for snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003493. 24/05/23 03:08:02,121 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1716404882108_1183932657463440669.dat to /mnt/ssd/0/celeborn-ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.28_1185003493. ``` After reinitializing, follower out of sync.  There are some variables find from follower jvm heap 1. StateMachineUpdater.appliedIndex = 1185003348 2. SegmentedRaftLog.committedIndex = 1188743857 3. SegmentedRaftLog startIndex = 1185003494, endIndex = 1188743857 4. StateMachineUpdater.snapshotIndex = 1185003348 5. ServerState.latestedInstalledSnapshot=1185003493 1. When will reinitializing be triggered? It is triggered after leader sends a snapshot to the follower and finish installing. 2. When does the leader send a snapshot to the followers in the running state? When the follower's nextIndex < the leader's startIndex, that is, the log required by the follower no longer exists in the leader. 2.1 Why does the log not exist in the leader? There is a certain synchronization delay between the leader and the folower. When the leader takes the snapshot first, it will purge its own segment log after taking it. At this time, the log with index <= snapshot index can be purged, because there is already a snapshot. At this time, the above judgment will be Take effect and then send the snapshot org.apache.ratis.server.leader.LogAppender#shouldInstallSnapshot ``` final long followerNextIndex = getFollower().getNextIndex(); if (followerNextIndex < getRaftLog().getNextIndex()) { final long logStartIndex = getRaftLog().getStartIndex(); if (followerNextIndex < logStartIndex || (logStartIndex == RaftLog.INVALID_LOG_INDEX && snapshot != null)) { return snapshot; } } ``` 3. Why is the snapshot sent by the leader to the follower not load correctly? When the follower receives the snapshot, it will write to snapshotDir, but it will not refresh the latestSnapshot, so it cannot see this snapshot when reinitialize. snapshot.getTermIndex().compareTo(getLastAppliedTermIndex()) <= 0 must be true 4. Why is the final metadata of the follower out of sync? After the leader sends the snapshot and receives the success status, it will update the follower's nextIndex to the sent snapshot last index + 1. The leader thinks that the follower has correctly installed snapsohot, but in fact it has not, which ultimately causes the follower to miss a section of the log. If the follower cannot find the log, it will lose synchronization forever. org.apache.ratis.server.impl.StateMachineUpdater#applyLog ``` final LogEntryProto next = raftLog.get(nextIndex); if (next != null) {xxx} else { LOG.debug("{}: logEntry {} is null. There may be snapshot to load. state:{}", this, nextIndex, state); break; } ``` 5. How do we solve this problem? We need to modify the SimpleStateMachineStorage currently used by celeborn. When the snapshot install is completed, we should select the latest snapshot from the local disk directory and update the latestSnapshot variable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
