yuuka created RATIS-2148:
----------------------------
Summary: Error triggering ServerState.reloadStatemachine
Key: RATIS-2148
URL: https://issues.apache.org/jira/browse/RATIS-2148
Project: Ratis
Issue Type: Bug
Components: snapshot
Affects Versions: 3.1.0, 3.2.0
Reporter: yuuka
Attachments: image-2024-09-03-14-24-25-652.png,
image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png,
image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png,
image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
Due to the fact that grpc streaming snapshot sending sends all requests at
once, error handling is performed after all are sent, and the last snapshot
request is used as a completion flag, which may lead to the successful receipt
of the last request, but the previous request has failed. The sender handles
the failure event during the retransmission of the snapshot. The receiver
triggers state.reloadStateMachine because it successfully receives the last
request, but due to incomplete snapshot reception
An md5 mismatch exception occurred before the last SnapshotRequest was received
!image-2024-09-03-14-27-39-406.png!
The last snapshot request arrived, then successfully received, and then updated
the index.
!image-2024-09-03-14-28-31-529.png!
!image-2024-09-03-14-30-02-751.png!
However, the snapshot reception is incomplete and triggers the
reloadStateMachine.
!image-2024-09-03-14-33-49-573.png!
I suggest using a flag to identify whether the entire snapshot request is
abnormal.
If an exception occurs, the subsequent content of the request will not be
processed.
Or the sender will wait for the receiver's reply. If there is a release error,
resend it.
Finally, the current error retry level is the entire snapshot directory rather
than a single chunk, which will cause a large number of snapshot files to be
sent repeatedly, which can be optimized later
--
This message was sent by Atlassian Jira
(v8.20.10#820010)