ivandika3 opened a new pull request, #7459:
URL: https://github.com/apache/ozone/pull/7459
## What changes were proposed in this pull request?
We encountered an issue where the SCM is stuck after transfer leadership,
causing SCM leader to be stuck and all client requests to timeout (including
OMs).
We saw that SCM is throwing TimeoutException in StateMachineUpdater (the
thread in charge of applying Raft logs and completing user requests), causing
the whole SCM request processing to be stuck.
```
2024-11-18 15:54:50,182
[daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater]
ERROR org.apache.hadoop.hdds.scm.container.ReplicationManager: Exception while
cleaning up excess replicas.
java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
at com.sun.proxy.$Proxy19.completeMove(Unknown Source)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.deleteSrcDnForMove(ReplicationManager.java:1610)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.lambda$onLeaderReadyAndOutOfSafeMode$36(ReplicationManager.java:2364)
at
java.util.concurrent.ConcurrentHashMap.forEach(ConcurrentHashMap.java:1597)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.onLeaderReadyAndOutOfSafeMode(ReplicationManager.java:2342)
at
org.apache.hadoop.hdds.scm.container.ReplicationManager.notifyStatusChanged(ReplicationManager.java:2103)
at
org.apache.hadoop.hdds.scm.ha.SCMServiceManager.notifyStatusChanged(SCMServiceManager.java:53)
at
org.apache.hadoop.hdds.scm.ha.SCMStateMachine.notifyTermIndexUpdated(SCMStateMachine.java:338)
at
org.apache.ratis.server.impl.RaftServerImpl.applyLogToStateMachine(RaftServerImpl.java:1650)
at
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:239)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:182)
at java.lang.Thread.run(Thread.java:748)
2024-11-18 15:54:50,183
[daa4f362-f48d-4933-96b3-840a8739f1d9@group-C0BCE64451CF-StateMachineUpdater]
INFO org.apache.hadoop.hdds.scm.container.ReplicationManager: can not remove
source replica after successfully replicated to target datanode
2024-11-18 15:54:50,745
[EventQueue-CloseContainerForCloseContainerEventHandler] ERROR
org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on execution
message #8764007
java.lang.reflect.UndeclaredThrowableException
at com.sun.proxy.$Proxy16.updateContainerState(Unknown Source)
at
org.apache.hadoop.hdds.scm.container.ContainerManagerImpl.updateContainerState(ContainerManagerImpl.java:332)
at
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:82)
at
org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler.onMessage(CloseContainerEventHandler.java:51)
at
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
... 8 more
2024-11-18 15:54:50,746 [EventQueue-DeleteBlockStatusForDeletedBlockLogImpl]
WARN org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager:
Could not commit delete block transactions: []
java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1784)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.submitRequest(SCMRatisServerImpl.java:228)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invokeRatis(SCMHAInvocationHandler.java:110)
at
org.apache.hadoop.hdds.scm.ha.SCMHAInvocationHandler.invoke(SCMHAInvocationHandler.java:67)
at com.sun.proxy.$Proxy17.removeTransactionsFromDB(Unknown Source)
at
org.apache.hadoop.hdds.scm.block.SCMDeletedBlockTransactionStatusManager.commitTransactions(SCMDeletedBlockTransactionStatusManager.java:527)
at
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:384)
at
org.apache.hadoop.hdds.scm.block.DeletedBlockLogImpl.onMessage(DeletedBlockLogImpl.java:73)
at
org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
We found that the root cause this following call chains.
- StateMachine#notifyTermIndexUpdated (due to the transfer leadership)
- ReplicationManager#notifyStatusChanged
- LegacyReplicationManager#notifyStatusChanged
- LegacyReplicationManager#onLeaderReadyAndOutofSafeMode
- LegacyReplicationManager#deleteSrcDnForMove
- LegacyReplicationManager.MoveScheduler#completeMove (@Replicated
annotation means that it will submit Ratis request)
- SCMHAInvocationHandler#invokeRatisServer
- SCMRatiServerImpl#submitRequest
- RaftServerImpl#submitClientRequestAsync
We should never send a Ratis request under the
ReplicationManager#notifyStatusChanged since this will cause a deadlock with
Ratis StateMachineUpdater. When ReplicationManager#notifyStatusChanged call the
MoveScheduler#completeMove, it will send a Ratis request to the Raft server and
wait until the log associated with it is applied by the StateMachineUpdater.
However, since ReplicationManger#notifyStatusChanged is run under the
StateMachineUpdater, it will block the StateMachineUpdater itself, meaning that
the Raft log associated with request sent by MoveScheduler#completeMove will
never be applied and there will be a deadlock. This will cause
StateMachineUpdater to get stuck and most SCM client requests to timeout in the
StateMachineUpdater.
Currently, one possible fix might be to just remove the
onLeaderReadyAndOutOfSafeMode implementation altogether and hopefully the
inflight move will be handled up by the main ReplicationManager thread.
Note: The issue should STILL happen after
[HDDS-10690](https://issues.apache.org/jira/browse/HDDS-10690) since although
StateMachine#notifyTermIndexUpdated will not trigger
ReplicationManager#notifyStatusChanged, StateMachine#notifyLeaderReady will
instead trigger the Replicationmanager#notifyStatusChanged. Since
StateMachine#notifyLeaderReady is still called in the StateMachineUpdater
through RaftServerImpl#applyLogToStateMachine.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11750
## How was this patch tested?
WIP.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]