ivandika3 commented on PR #6696:
URL: https://github.com/apache/ozone/pull/6696#issuecomment-2119125079
Found `RaftRetryFailureException` due to
`org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception`. Since the current Raft client policy is from
`RequestTypeDependentRetryPolicyCreator`, it seems the `StatusRuntimeException`
retry policy is `MultipleLinearRandomRetry` (others) which might block the
datanode when it's trying to remove group from other datanode due to connection
timeout.
```
024-05-19 04:08:53,822
[f3d35daa-4aaf-4c71-973e-2392b4bdb0cc-PipelineCommandHandlerThread-0] WARN
commandhandler.ClosePipelineCommandHandler
(ClosePipelineCommandHandler.java:lambda$null$1(131)) - Failed to remove group
group-72E5E9088A19 of pipeline PipelineID=8643905b-c6aa-4c00-96b1-72e5e9088a19
on peer de8c5e91-909e-46c9-a24e-1aa548fa4b98
org.apache.ratis.protocol.exceptions.RaftRetryFailureException: Failed
GroupManagementRequest:client-363EA46DA04D->de8c5e91-909e-46c9-a24e-1aa548fa4b98@group-72E5E9088A19,
cid=120, seq=null, RW, null, Remove:group-72E5E9088A19, delete-dir for 25
attempts with
RequestTypeDependentRetryPolicy{WRITE->ExceptionDependentRetry(maxAttempts=2147483647;
defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x15s, 5x20s, 5x25s,
10x60s];
map={org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry,
org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry,
org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca,
org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry,
org.apache.ratis.protocol.exceptions.TimeoutIOException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca}),
WATCH->ExceptionDependentRetry(maxAttempts=2147483647;
defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x1
5s, 5x20s, 5x25s, 10x60s];
map={org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry,
org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry,
org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca,
org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry,
org.apache.ratis.protocol.exceptions.TimeoutIOException->NoRetry})}
at
org.apache.ratis.client.impl.RaftClientImpl.noMoreRetries(RaftClientImpl.java:353)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:129)
at
org.apache.ratis.client.impl.GroupManagementImpl.remove(GroupManagementImpl.java:61)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.lambda$null$1(ClosePipelineCommandHandler.java:123)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at
java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1652)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.lambda$handle$2(ClosePipelineCommandHandler.java:120)
at
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException:
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io
exception
at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:99)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:223)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:170)
at
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:98)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:145)
at
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:109)
... 16 more
Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException:
UNAVAILABLE: io exception
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
at
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
at
org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:468)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$5(GrpcClientProtocolClient.java:172)
at
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:221)
... 20 more
Caused by:
org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: Connection refused: /10.1.0.100:15049
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection
refused
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
at
org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
at
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]