Lokesh Jain created RATIS-690:
---------------------------------
Summary: RaftClient should reconnect and change leader on
TimeoutIOException, AlreadyClosedException
Key: RATIS-690
URL: https://issues.apache.org/jira/browse/RATIS-690
Project: Ratis
Issue Type: Bug
Components: client
Reporter: Lokesh Jain
Assignee: Lokesh Jain
RaftClient does not change leaderId and does not refresh the streamObserver for
the old leaderId in case of TimeoutIOException. For AlreadyClosedException also
the raft client should retry to a different leader.
For TimeoutIOException leader change is good to have if the server has false
notion of being the leader. Similarly for AlreadyClosedException I guess it
will be good to try sending request to a different server since the existing
stream might be closed by server.
{code:java}
19/09/25 18:09:42 DEBUG client.RaftClient: client-E5E9D444D128: suggested new
leader: null. Failed
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c4
with {}19/09/25 18:09:42 DEBUG client.RaftClient: client-E5E9D444D128:
suggested new leader: null. Failed
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c4
with {}org.apache.ratis.protocol.TimeoutIOException: Request timeout 3000ms:
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c4 at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$timeoutCheck$3(GrpcClientProtocolClient.java:319)
at java.util.Optional.ifPresent(Optional.java:159) at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:323)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.timeoutCheck(GrpcClientProtocolClient.java:318)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.lambda$onNext$1(GrpcClientProtocolClient.java:312)
at
org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$0(TimeoutScheduler.java:113)
at
org.apache.ratis.util.TimeoutScheduler.lambda$onTimeout$1(TimeoutScheduler.java:133)
at org.apache.ratis.util.LogUtils.runAndLog(LogUtils.java:50) at
org.apache.ratis.util.LogUtils$1.run(LogUtils.java:91) at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)19/09/25 18:09:42 DEBUG
client.RaftClient: client-E5E9D444D128:
oldLeader=46001d60-9ca0-4042-a521-69cb38b96882,
curLeader=46001d60-9ca0-4042-a521-69cb38b96882,
newLeader=394810ca-48fe-47bd-a0c8-eeaaec380bc319/09/25 18:09:42 DEBUG
impl.OrderedAsync: schedule* attempt #171 with policy
RetryLimited(maxAttempts=180, sleepTime=1000ms) for
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c419/09/25
18:09:43 DEBUG impl.OrderedAsync: client-E5E9D444D128: send*
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c419/09/25
18:09:43 DEBUG client.RaftClient: client-E5E9D444D128: suggested new leader:
null. Failed
RaftClientRequest:client-E5E9D444D128->46001d60-9ca0-4042-a521-69cb38b96882@group-0B8E59E48142,
cid=34, seq=11*, RW,
org.apache.hadoop.hdds.scm.XceiverClientRatis$$Lambda$72/2083154356@108d55c4
with {}java.io.IOException: java.lang.IllegalStateException: Entry already
exists for key 11 in map
SlidingWindow$Server:257-OrderedRequestStreamObserver257:requests at
org.apache.ratis.util.IOUtils.asIOException(IOUtils.java:54) at
org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:106) at
org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at
org.apache.ratis.grpc.GrpcUtil.unwrapIOException(GrpcUtil.java:128) at
org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onError(GrpcClientProtocolClient.java:283)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:434)
at
org.apache.ratis.thirdparty.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at
org.apache.ratis.thirdparty.io.grpc.internal.CensusStatsModule$StatsClientInterceptor$1$1.onClose(CensusStatsModule.java:678)
at
org.apache.ratis.thirdparty.io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
at
org.apache.ratis.thirdparty.io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
at
org.apache.ratis.thirdparty.io.grpc.internal.CensusTracingModule$TracingClientInterceptor$1$1.onClose(CensusTracingModule.java:397)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:459)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:63)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:546)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$600(ClientCallImpl.java:467)
at
org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:584)
at
org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at
org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)Caused by:
java.lang.IllegalStateException: Entry already exists for key 11 in map
SlidingWindow$Server:257-OrderedRequestStreamObserver257:requests at
sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown Source) at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
org.apache.ratis.util.ReflectionUtils.instantiateException(ReflectionUtils.java:222)
at org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:104) ... 22
more
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)