[
https://issues.apache.org/jira/browse/RATIS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878968#comment-17878968
]
yuuka commented on RATIS-2148:
------------------------------
Yes, I am considering how to better fix this problem.
> GRPC streaming snapshot transfer may cause followers to trigger
> reloadStateMachine incorrectly
> ----------------------------------------------------------------------------------------------
>
> Key: RATIS-2148
> URL: https://issues.apache.org/jira/browse/RATIS-2148
> Project: Ratis
> Issue Type: Bug
> Components: snapshot
> Affects Versions: 3.1.0, 3.2.0
> Reporter: yuuka
> Priority: Major
> Attachments: image-2024-09-03-14-24-25-652.png,
> image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png,
> image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png,
> image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png
>
>
> Due to the fact that grpc streaming snapshot sending sends all requests at
> once, error handling is performed after all are sent, and the last snapshot
> request is used as a completion flag, which may lead to the successful
> receipt of the last request, but the previous request has failed. The sender
> handles the failure event during the retransmission of the snapshot. The
> receiver triggers state.reloadStateMachine because it successfully receives
> the last request, but due to incomplete snapshot reception
>
> An md5 mismatch exception occurred before the last SnapshotRequest was
> received
> !image-2024-09-03-14-27-39-406.png!
>
> The last snapshot request arrived, then successfully received, and then
> updated the index.
> !image-2024-09-03-14-28-31-529.png!
> !image-2024-09-03-14-30-02-751.png!
>
> However, the snapshot reception is incomplete and triggers the
> reloadStateMachine.
> !image-2024-09-03-14-33-49-573.png!
>
> I suggest using a flag to identify whether the entire snapshot request is
> abnormal.
> If an exception occurs, the subsequent content of the request will not be
> processed.
> Or the sender will wait for the receiver's reply. If there is a release
> error, resend it.
>
> Finally, the current error retry level is the entire snapshot directory
> rather than a single chunk, which will cause a large number of snapshot files
> to be sent repeatedly, which can be optimized later
--
This message was sent by Atlassian Jira
(v8.20.10#820010)