ivandika3 opened a new pull request, #7459:
URL: https://github.com/apache/ozone/pull/7459

   ## What changes were proposed in this pull request?
   
   We encountered an issue where the SCM is stuck after transfer leadership, 
causing SCM leader to be stuck and all client requests to timeout (including 
OMs).
   
   We saw that SCM is throwing TimeoutException in StateMachineUpdater (the 
thread in charge of applying Raft logs and completing user requests), causing 
the whole SCM request processing to be stuck.
   
   ```
   2024-11-18 15:54:50,182 
[daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] 
ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception while 
cleaning up excess replicas.
   java.util.concurrent.TimeoutException
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
           at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
           at com.sun.proxy.$Proxy19.completeMove(Unknown Source)
           at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.deleteSrcDnForMove(ReplicationManager.java:1610)
           at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.lambda$onLeaderReadyAndOutOfSafeMode$36(ReplicationManager.java:2364)
           at 
java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
           at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.onLeaderReadyAndOutOfSafeMode(ReplicationManager.java:2342)
           at 
org.apache.hadoop.hdds.scm.container.ReplicationManager.notifyStatusChanged(ReplicationManager.java:2103)
           at 
org.apache.hadoop.hdds.scm.ha.SCMServiceManager.notifyStatusChanged(SCMServiceManager.java:53)
           at 
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:338)
           at 
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1650)
           at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
           at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
           at java.lang.Thread.run(Thread.java:748)
   2024-11-18 15:54:50,183 
[daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater] 
INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: can not remove 
source replica after successfully replicated to target datanode
   2024-11-18 15:54:50,745 
[EventQueue-CloseContainerForCloseContainerEventHandler] ERROR 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on execution 
message #8764007
   java.lang.reflect.UndeclaredThrowableException
           at com.sun.proxy.$Proxy16.updateContainerState(Unknown Source)
           at 
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:332)
           at 
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:82)
           at 
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:51)
           at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.util.concurrent.TimeoutException
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
           at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
           ... 8 more
   2024-11-18 15:54:50,746 [EventQueue-DeleteBlockStatusForDeletedBlockLogImpl] 
WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager: 
Could not commit delete block transactions: []
   java.util.concurrent.TimeoutException
           at 
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
           at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
           at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
           at 
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
           at com.sun.proxy.$Proxy17.removeTransactionsFromDB(Unknown Source)
           at 
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager.commitTransactions(SCMDeletedBlockTransactionStatusManager.java:527)
           at 
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:384)
           at 
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:73)
           at 
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748) 
   ```
   We found that the root cause this following call chains. 
   
   - StateMachine#notifyTermIndexUpdated (due to the transfer leadership)
     - ReplicationManager#notifyStatusChanged
       - LegacyReplicationManager#notifyStatusChanged 
         - LegacyReplicationManager#onLeaderReadyAndOutofSafeMode
           - LegacyReplicationManager#deleteSrcDnForMove
             - LegacyReplicationManager.MoveScheduler#completeMove (@Replicated 
annotation means that it will submit Ratis request)
               - SCMHAInvocationHandler#invokeRatisServer
                 - SCMRatiServerImpl#submitRequest
                   - RaftServerImpl#submitClientRequestAsync
   
   We should never send a Ratis request under the 
ReplicationManager#notifyStatusChanged since this will cause a deadlock with 
Ratis StateMachineUpdater. When ReplicationManager#notifyStatusChanged call the 
MoveScheduler#completeMove, it will send a Ratis request to the Raft server and 
wait until the log associated with it is applied by the StateMachineUpdater. 
However, since ReplicationManger#notifyStatusChanged is run under the 
StateMachineUpdater, it will block the StateMachineUpdater itself, meaning that 
the Raft log associated with request sent by MoveScheduler#completeMove will 
never be applied and there will be a deadlock. This will cause 
StateMachineUpdater to get stuck and most SCM client requests to timeout in the 
StateMachineUpdater.
   
   Currently, one possible fix might be to just remove the 
onLeaderReadyAndOutOfSafeMode implementation altogether and hopefully the 
inflight move will be handled up by the main ReplicationManager thread.
   
   Note: The issue should STILL happen after 
[HDDS-10690](https://issues.apache.org/jira/browse/HDDS-10690) since although 
StateMachine#notifyTermIndexUpdated will not trigger 
ReplicationManager#notifyStatusChanged, StateMachine#notifyLeaderReady will 
instead trigger the Replicationmanager#notifyStatusChanged. Since 
StateMachine#notifyLeaderReady is still called in the StateMachineUpdater 
through RaftServerImpl#applyLogToStateMachine.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-11750
   
   ## How was this patch tested?
   
   WIP.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to