[
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tsz-wo Sze updated RATIS-2148:
------------------------------
Fix Version/s: 3.1.1
(was: 3.2.0)
> Snapshot transfer may cause followers to trigger reloadStateMachine
> incorrectly
> -------------------------------------------------------------------------------
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
> Issue Type: Bug
> Components: snapshot
> Affects Versions: 3.1.0
> Reporter: yuuka
> Assignee: yuuka
> Priority: Major
> Fix For: 3.1.1
>
> Attachments: image-2024-09-03-14-24-25-652.png,
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png,
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png,
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> Due to the fact that grpc streaming snapshot sending sends all requests at
> once, error handling is performed after all are sent, and the last snapshot
> request is used as a completion flag, which may lead to the successful
> receipt of the last request, but the previous request has failed. The sender
> handles the failure event during the retransmission of the snapshot. The
> receiver triggers state.reloadStateMachine because it successfully receives
> the last request, but due to incomplete snapshot reception
>
> An md5 mismatch exception occurred before the last SnapshotRequest was
> received
> !image-2024-09-03-14-27-39-406.png!
>
> The last snapshot request arrived, then successfully received, and then
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>
> However, the snapshot reception is incomplete and triggers the
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>
> I suggest using a flag to identify whether the entire snapshot request is
> abnormal.
> If an exception occurs, the subsequent content of the request will not be
> processed.
> Or the sender will wait for the receiver's reply. If there is a release
> error, resend it.
>
> Finally, the current error retry level is the entire snapshot directory
> rather than a single chunk, which will cause a large number of snapshot files
> to be sent repeatedly, which can be optimized later
--
This message was sent by Atlassian Jira
(v8.20.10#820010)