ivandika3 commented on PR #6696:
URL: https://github.com/apache/ozone/pull/6696#issuecomment-2119125079

   Found `RaftRetryFailureException` due to 
`org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception`. Since the current Raft client policy is from 
`RequestTypeDependentRetryPolicyCreator`, it seems the `StatusRuntimeException` 
retry policy is  `MultipleLinearRandomRetry` (others) which might block the 
datanode when it's trying to remove group from other datanode due to connection 
timeout.
   
   ```
   024-05-19 04:08:53,822 
[f3d35daa-4aaf-4c71-973e-2392b4bdb0cc-PipelineCommandHandlerThread-0] WARN  
commandhandler.ClosePipelineCommandHandler 
(ClosePipelineCommandHandler.java:lambda$null$1(131)) - Failed to remove group 
group-72E5E9088A19 of pipeline PipelineID=8643905b-c6aa-4c00-96b1-72e5e9088a19 
on peer de8c5e91-909e-46c9-a24e-1aa548fa4b98
   org.apache.ratis.protocol.exceptions.RaftRetryFailureException: Failed 
GroupManagementRequest:client-363EA46DA04D->de8c5e91-909e-46c9-a24e-1aa548fa4b98@group-72E5E9088A19,
 cid=120, seq=null, RW, null, Remove:group-72E5E9088A19, delete-dir for 25 
attempts with 
RequestTypeDependentRetryPolicy{WRITE->ExceptionDependentRetry(maxAttempts=2147483647;
 defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x15s, 5x20s, 5x25s, 
10x60s]; 
map={org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry, 
org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry, 
org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca,
 org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry, 
org.apache.ratis.protocol.exceptions.TimeoutIOException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca}),
 WATCH->ExceptionDependentRetry(maxAttempts=2147483647; 
defaultPolicy=MultipleLinearRandomRetry[5x5s, 5x10s, 5x1
 5s, 5x20s, 5x25s, 10x60s]; 
map={org.apache.ratis.protocol.exceptions.GroupMismatchException->NoRetry, 
org.apache.ratis.protocol.exceptions.NotReplicatedException->NoRetry, 
org.apache.ratis.protocol.exceptions.ResourceUnavailableException->org.apache.ratis.retry.ExponentialBackoffRetry@313565ca,
 org.apache.ratis.protocol.exceptions.StateMachineException->NoRetry, 
org.apache.ratis.protocol.exceptions.TimeoutIOException->NoRetry})}
        at 
org.apache.ratis.client.impl.RaftClientImpl.noMoreRetries(RaftClientImpl.java:353)
        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:129)
        at 
org.apache.ratis.client.impl.GroupManagementImpl.remove(GroupManagementImpl.java:61)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.lambda$null$1(ClosePipelineCommandHandler.java:123)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
        at 
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
        at 
java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1652)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.ClosePipelineCommandHandler.lambda$handle$2(ClosePipelineCommandHandler.java:120)
        at 
java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   Caused by: java.io.IOException: 
org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: UNAVAILABLE: io 
exception
        at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:99)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:223)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:170)
        at 
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:98)
        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequest(BlockingImpl.java:145)
        at 
org.apache.ratis.client.impl.BlockingImpl.sendRequestWithRetry(BlockingImpl.java:109)
        ... 16 more
   Caused by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
UNAVAILABLE: io exception
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
        at 
org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
        at 
org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:468)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$5(GrpcClientProtocolClient.java:172)
        at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:221)
        ... 20 more
   Caused by: 
org.apache.ratis.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 finishConnect(..) failed: Connection refused: /10.1.0.100:15049
   Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
refused
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.newConnectException0(Errors.java:166)
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:131)
        at 
org.apache.ratis.thirdparty.io.netty.channel.unix.Socket.finishConnect(Socket.java:359)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:499)
        at 
org.apache.ratis.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
        at 
org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:750)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to