[jira] [Comment Edited] (RATIS-592) One node ratis writes fail forever after first NotLeaderException or LeaderNotReadyException

Lokesh Jain (JIRA) Wed, 19 Jun 2019 05:45:26 -0700


    [ 
https://issues.apache.org/jira/browse/RATIS-592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867572#comment-16867572
 ]


Lokesh Jain edited comment on RATIS-592 at 6/19/19 12:44 PM:
-------------------------------------------------------------

[~swagle] Thanks for looking into this! LeaderNotReadyException currently is 
received in GrpcClientProtocolClient$AsyncStreamObservers#onError. The 
exception should rather be part of RaftClientReply. The reason is we should not 
be closing the stream observer on receiving a LeaderNotReadyException.

Further changes in RATIS-571 related to GrpcClientProtocolClient do not work 
for single node ratis pipeline. For multi node ratis pipeline on receiving 
NotLeaderException the leader is changed and hence the old streamObserver is 
closed(RaftClientImpl#handleIOException). For a single node pipeline the leader 
can not be changed therefore the stream is never closed. That is a bug.


was (Author: ljain):
[~swagle] Thanks for looking into this! LeaderNotReadyException currently is 
received in GrpcClientProtocolClient$AsyncStreamObservers#onError. The 
exception should rather be part of RaftClientReply. The reason is we should not 
be closing the stream observer on receiving a LeaderNotReadyException.

Further RATIS-571 does not work for single node ratis pipeline. For multi node 
ratis pipeline on receiving NotLeaderException the leader is changed and hence 
the old streamObserver is closed(RaftClientImpl#handleIOException). For a 
single node pipeline the leader can not be changed therefore the stream is 
never closed. We need to address that issue as well.

> One node ratis writes fail forever after first NotLeaderException or 
> LeaderNotReadyException
> --------------------------------------------------------------------------------------------
>
>                 Key: RATIS-592
>                 URL: https://issues.apache.org/jira/browse/RATIS-592
>             Project: Ratis
>          Issue Type: Bug
>          Components: gRPC
>    Affects Versions: 0.3.0
>            Reporter: Siddharth Wagle
>            Assignee: Siddharth Wagle
>            Priority: Critical
>             Fix For: 0.4.0
>
>         Attachments: RATIS-592.01.patch
>
>
> RATIS-571, modified the GrpcClientProtocolClient to not set the 
> AsyncStreamObserver reference to null on an exception, however, the ReplyMap 
> reference is set to null. This results in the client getting an 
> AlredyClosedException on the stream on a retry for a NotLeader or a 
> LeadrNotReady exception and never recovers. This is common in a unit test 
> scenario where a request is sent immediately after the cluster is up.
> There is nothing special here about one node Ratis however, the HDDS unit 
> tests that fail are all one node Ratis and the most probable cause is that 
> with client retrying a different node each time, increases the chance of 
> success on a three-node ring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (RATIS-592) One node ratis writes fail forever after first NotLeaderException or LeaderNotReadyException

Reply via email to