Hi folks,
I’ve discovered a bug in installSnapshot RPC handler, causing the follower to
reply success where it actually failed.
org.apache.ratis.server.storage.SnapshotManager.java
public void installSnapshot(StateMachine stateMachine,
InstallSnapshotRequestProto request) throws IOException {
...
if (snapshotChunkRequest.getDone()) {
LOG.info("Install snapshot is done, renaming tnp dir:{} to:{}",
tmpDir, dir.getStateMachineDir());
dir.getStateMachineDir().delete(); // Here delete() may fail
tmpDir.renameTo(dir.getStateMachineDir());
}
}
After the follower receives the entire snapshot data, it will first store the
file in a tmp dir, then renames to StateMachineDir. However, when the
StateMachineDir is not empty, delete() will fail, and renamTo() will fail too.
Under this scenario, the latest snapshot file will remain in tmp dir and the
statemachine cannot fetch the this snapshot.
The StateMachineDir can be non-empty since the old installed snapshots are
stored in StateMachineDir and may not be cleaned up due to retention policy,
next time when leader want to install snapshot again this circumstance will
appear.
Thanks!
William Song
Apache IoTDB