[ 
https://issues.apache.org/jira/browse/RATIS-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16869266#comment-16869266
 ] 

Lokesh Jain commented on RATIS-592:
-----------------------------------

[~swagle] Thanks for working on this! The patch looks good to me. Please find 
my comments below. 
 # In GrpcClientRpc#shouldReconnect - We would always reconnect in case of 
AlreadyClosedException. This can lead to a reconnection multiple times because 
in GrpcClientProtocolClient$AsyncStreamObservers#completeReplyExceptionally on 
completion or some other cases we will complete the pending requests with 
AlreadyClosedException. This exception would be handled for all the requests 
leading to reconnection multiple times.
 # RaftClientImpl#handleNotLeaderException - Can we rename this fn as we are 
also handling LeaderNotReadyException in it?

> One node ratis writes fail forever after first NotLeaderException or 
> LeaderNotReadyException
> --------------------------------------------------------------------------------------------
>
>                 Key: RATIS-592
>                 URL: https://issues.apache.org/jira/browse/RATIS-592
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC
>    Affects Versions: 0.3.0
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: RATIS-592.01.patch, RATIS-592.02.patch, 
> RATIS-592.03.patch
>
>
> RATIS-571, modified the GrpcClientProtocolClient to not set the 
> AsyncStreamObserver reference to null on an exception, however, the ReplyMap 
> reference is set to null. This results in the client getting an 
> AlredyClosedException on the stream on a retry for a NotLeader or a 
> LeadrNotReady exception and never recovers. This is common in a unit test 
> scenario where a request is sent immediately after the cluster is up.
> There is nothing special here about one node Ratis however, the HDDS unit 
> tests that fail are all one node Ratis and the most probable cause is that 
> with client retrying a different node each time, increases the chance of 
> success on a three-node ring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to