Janus Chow created HDDS-11291:
---------------------------------
Summary: Datanode Command Handler blocked by executing ratis
requests
Key: HDDS-11291
URL: https://issues.apache.org/jira/browse/HDDS-11291
Project: Apache Ozone
Issue Type: Bug
Reporter: Janus Chow
Assignee: Janus Chow
We met the following issue: Datanode command handler executing close container
request, but the timeout logic is not correct, so it blocks all requests from
SCM.
The jstack shows as follows:
{code:java}
"Command processor thread" #215 daemon prio=5 os_prio=0 tid=0x00007fcef3262000
nid=0xa56 waiting on condition [0x00007fcf63f9d000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00007fd4ab6dcd38> (a
java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
at
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
at
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
at
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
at
org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436)
at
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown
Source)
at
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
at
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
at
org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436)
at
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105)
at
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103)
at
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593)
at
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown
Source)
at java.lang.Thread.run(Thread.java:748) {code}
The direct reason is the timeout logic is not working, because in Ratis the
executeSubmitClientRequestAsync is a join() operation, and it will block the
timeout on the outer CompletableFuture.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]