[ 
https://issues.apache.org/jira/browse/RATIS-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691660#comment-17691660
 ] 

Xinyu Tan commented on RATIS-1782:
----------------------------------

[~szetszwo] Hi we have studied this problem again and found that OnError 
function will only be called once, so the above patch changes are invalid.

The following is the onError function comment for GRPC StreamObserver
{code:java}
public interface StreamObserver<V>  {
  /**
   * Receives a terminating error from the stream.
   *
   * <p> May only be called once and if called it must be the last method 
called. In particular if an
   * exception is thrown by an implementation of {@code onError} no further 
calls to any method are
   * allowed.
   *
   */
  void onError(Throwable t);
}
{code}
Upon further investigation, we found two problems:
1. The real cause of the above problem is probably that the timeout time of 
streaming all Snapshot is incorrectly set to the timeout time of a single RPC 
request. That is, once the snapshot streaming time exceeds 3s, the Snapshot 
send will fail, further causing installSnapshot retries and worsening the state 
of the system. This Issue will be followed up to improve the issue.
{code:java}
  StreamObserver<InstallSnapshotRequestProto> installSnapshot(
      StreamObserver<InstallSnapshotReplyProto> responseHandler) {
    return asyncStub.withDeadlineAfter(requestTimeoutDuration.getDuration(), 
requestTimeoutDuration.getUnit())
        .installSnapshot(responseHandler);
  }
{code}
2. digester of the Follower only calls `digester.get().digest()` to get an MD5 
and reset the engine when a file is completed. However, if the client does not 
complete the file transmission when the streaming time expires, digester is not 
reset during subsequent snapshot transfers. Therefore, a different md5 value is 
calculated, causing the snapshot to fail forever. I have created jira1786 to 
track down the problem.

> gRPC installSnapshot timeout handler malfunctioning 
> ----------------------------------------------------
>
>                 Key: RATIS-1782
>                 URL: https://issues.apache.org/jira/browse/RATIS-1782
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC, snapshot
>    Affects Versions: 2.4.1
>            Reporter: Song Ziyang
>            Assignee: Xinyu Tan
>            Priority: Blocker
>
> When gRPC logAppender fails to install a snapshot to a follower owing to 
> timeout, the onError callback will be invoked and resetClient is called. 
> However, in this resetClient[1] handler, installSnapshotResponseHandler is 
> not set to null (compared to  appendLogReponseHandler). In this way, pending 
> RPCs in the old installSnapshot pipe will timeout and call the onError again 
> sometime in the future, disrupting future on-going installSnapshot requests.
> [1] 
> https://github.com/apache/ratis/blob/18eacaed31e4965a9c400d86409a88fea21fc18a/ratis-grpc/src/main/java/org/apache/ratis/grpc/server/GrpcLogAppender.java#L117-L120



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to