Janus Chow created HDDS-11291:
---------------------------------

             Summary: Datanode Command Handler blocked by executing ratis 
requests
                 Key: HDDS-11291
                 URL: https://issues.apache.org/jira/browse/HDDS-11291
             Project: Apache Ozone
          Issue Type: Bug
            Reporter: Janus Chow
            Assignee: Janus Chow


We met the following issue: Datanode command handler executing close container 
request, but the timeout logic is not correct, so it blocks all requests from 
SCM.

The jstack shows as follows:
{code:java}
"Command processor thread" #215 daemon prio=5 os_prio=0 tid=0x00007fcef3262000 
nid=0xa56 waiting on condition [0x00007fcf63f9d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007fd4ab6dcd38> (a 
java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at 
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at 
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at 
java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1947)
        at 
org.apache.ratis.server.impl.RaftServerImpl.executeSubmitClientRequestAsync(RaftServerImpl.java:816)
        at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$submitClientRequestAsync$7(RaftServerProxy.java:436)
        at 
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$827/1961332062.apply(Unknown
 Source)
        at 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)
        at 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)
        at 
org.apache.ratis.server.impl.RaftServerProxy.submitClientRequestAsync(RaftServerProxy.java:436)
        at 
org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis.submitRequest(XceiverServerRatis.java:611)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CloseContainerCommandHandler.handle(CloseContainerCommandHandler.java:105)
        at 
org.apache.hadoop.ozone.container.common.statemachine.commandhandler.CommandDispatcher.handle(CommandDispatcher.java:103)
        at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine.lambda$initCommandHandlerThread$3(DatanodeStateMachine.java:593)
        at 
org.apache.hadoop.ozone.container.common.statemachine.DatanodeStateMachine$$Lambda$270/1788388131.run(Unknown
 Source)
        at java.lang.Thread.run(Thread.java:748) {code}
The direct reason is the timeout logic is not working, because in Ratis the 
executeSubmitClientRequestAsync is a join() operation, and it will block the 
timeout on the outer CompletableFuture.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to