[jira] [Created] (RATIS-2148) Error triggering ServerState.reloadStatemachine

yuuka (Jira) Mon, 02 Sep 2024 23:42:04 -0700

yuuka created RATIS-2148:
----------------------------

             Summary: Error triggering ServerState.reloadStatemachine 
                 Key: RATIS-2148
                 URL: https://issues.apache.org/jira/browse/RATIS-2148
             Project: Ratis
          Issue Type: Bug
          Components: snapshot
    Affects Versions: 3.1.0, 3.2.0
            Reporter: yuuka
         Attachments: image-2024-09-03-14-24-25-652.png, 
image-2024-09-03-14-25-22-174.png, image-2024-09-03-14-27-39-406.png, 
image-2024-09-03-14-28-31-529.png, image-2024-09-03-14-30-02-751.png, 
image-2024-09-03-14-33-40-760.png, image-2024-09-03-14-33-49-573.png


Due to the fact that grpc streaming snapshot sending sends all requests at 
once, error handling is performed after all are sent, and the last snapshot 
request is used as a completion flag, which may lead to the successful receipt 
of the last request, but the previous request has failed. The sender handles 
the failure event during the retransmission of the snapshot. The receiver 
triggers state.reloadStateMachine because it successfully receives the last 
request, but due to incomplete snapshot reception
 
An md5 mismatch exception occurred before the last SnapshotRequest was received
!image-2024-09-03-14-27-39-406.png!
 
The last snapshot request arrived, then successfully received, and then updated 
the index.
!image-2024-09-03-14-28-31-529.png!
!image-2024-09-03-14-30-02-751.png!
 
However, the snapshot reception is incomplete and triggers the 
reloadStateMachine.
!image-2024-09-03-14-33-49-573.png!
 
I suggest using a flag to identify whether the entire snapshot request is 
abnormal.
If an exception occurs, the subsequent content of the request will not be 
processed.
Or the sender will wait for the receiver's reply. If there is a release error, 
resend it.
 
Finally, the current error retry level is the entire snapshot directory rather 
than a single chunk, which will cause a large number of snapshot files to be 
sent repeatedly, which can be optimized later



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (RATIS-2148) Error triggering ServerState.reloadStatemachine

Reply via email to