[
https://issues.apache.org/jira/browse/HDDS-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ashish Kumar updated HDDS-10643:
--------------------------------
Description:
When SCM stop is called, SCM first stops raft server which internally invokes
state machine to
[close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
Since this call is from ratis, statemachine tries to
[terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
SCM by invoking [System.exit
|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]call.
But System.exit waits for shutdown-hook to gets completed. Shutdown-hook is
waiting for main thread(waiting for raft server close) to complete. So both
threads are waiting for each other and leads to deadlock situation.
It depends on the timeout for shutdown-hook after that it interrupts the thread
abruptly.
Currently after this SCM shuts down which is not graceful exit.
Below is the stack:
StateMachineUpdater:
"fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater"
#58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s tid=0x00007fdc96449000
nid=0x188160 waiting for monitor entry [0x00007fdc60bd6000]
java.lang.Thread.State: *BLOCKED* (on object monitor)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
- waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for
java.lang.Shutdown)
at java.lang.Runtime.exit([email protected]/Runtime.java:115)
at java.lang.System.exit([email protected]/System.java:1746)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
at
org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
at java.lang.Thread.run([email protected]/Thread.java:834)
Shutdown-Hook:
"SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s
tid=0x0000000000e99000 nid=0x18af9c in Object.wait() [0x00007fdc4fd00000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait([email protected]/Native Method)
- waiting on <no object reference available>
at java.lang.Thread.join([email protected]/Thread.java:1305)
- waiting to re-lock in wait() <0x000000071165a3f8> (a
org.apache.hadoop.ozone.util.ShutdownHookManager$1)
at java.lang.Thread.join([email protected]/Thread.java:1379)
at
java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
at
java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
- locked <{*}0x000000072d15c608{*}> (a java.lang.Class for
java.lang.Shutdown)
at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
at
sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
at
org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
at
sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
at java.lang.Thread.run([email protected]/Thread.java:834)
Raftserver stop:
"shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s
tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition [0x00007fdc46579000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000072c928090> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
at
java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
at
org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
at
org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
at
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
Source)
at
org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
at
org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown
Source)
at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
at
org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
at
org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
Source)
at
java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:834)
was:
When SCM stop is called, SCM first stops raft server which internally invokes
state machine to
[close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
Since this call is from ratis, statemachine tries to
[terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
SCM by invoking [System.exit
|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]]call.
But System.exit waits for shutdown-hook to gets completed. Shutdown-hook is
waiting for main thread(waiting for raft server close) to complete. So both
threads are waiting for each other and leads to deadlock situation.
It depends on the timeout for shutdown-hook after that it interrupts the thread
abruptly.
Currently after this SCM shuts down which is not graceful exit.
Below is the stack:
StateMachineUpdater:
"fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater"
#58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s tid=0x00007fdc96449000
nid=0x188160 waiting for monitor entry [0x00007fdc60bd6000]
java.lang.Thread.State: *BLOCKED* (on object monitor)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
- waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for
java.lang.Shutdown)
at java.lang.Runtime.exit([email protected]/Runtime.java:115)
at java.lang.System.exit([email protected]/System.java:1746)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
at
org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
at
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
at java.lang.Thread.run([email protected]/Thread.java:834)
Shutdown-Hook:
"SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s
tid=0x0000000000e99000 nid=0x18af9c in Object.wait() [0x00007fdc4fd00000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait([email protected]/Native Method)
- waiting on <no object reference available>
at java.lang.Thread.join([email protected]/Thread.java:1305)
- waiting to re-lock in wait() <0x000000071165a3f8> (a
org.apache.hadoop.ozone.util.ShutdownHookManager$1)
at java.lang.Thread.join([email protected]/Thread.java:1379)
at
java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
at
java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
- locked <{*}0x000000072d15c608{*}> (a java.lang.Class for
java.lang.Shutdown)
at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
at
sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
at
org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
at
sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
at java.lang.Thread.run([email protected]/Thread.java:834)
Raftserver stop:
"shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s
tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition [0x00007fdc46579000]
java.lang.Thread.State: TIMED_WAITING (parking)
at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
- parking to wait for <0x000000072c928090> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
at
java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
at
org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
at
org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
at
org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
at
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
Source)
at
org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
at
org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown
Source)
at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
at
org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
at
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
at
org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
at
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
Source)
at
java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
at
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
at java.lang.Thread.run([email protected]/Thread.java:834)
> SCM fails to stop gracefully
> ----------------------------
>
> Key: HDDS-10643
> URL: https://issues.apache.org/jira/browse/HDDS-10643
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ashish Kumar
> Assignee: Ashish Kumar
> Priority: Major
>
> When SCM stop is called, SCM first stops raft server which internally invokes
> state machine to
> [close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
>
> Since this call is from ratis, statemachine tries to
> [terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
> SCM by invoking [System.exit
> |https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]call.
>
> But System.exit waits for shutdown-hook to gets completed. Shutdown-hook is
> waiting for main thread(waiting for raft server close) to complete. So both
> threads are waiting for each other and leads to deadlock situation.
> It depends on the timeout for shutdown-hook after that it interrupts the
> thread abruptly.
> Currently after this SCM shuts down which is not graceful exit.
> Below is the stack:
> StateMachineUpdater:
> "fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater"
> #58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s
> tid=0x00007fdc96449000 nid=0x188160 waiting for monitor entry
> [0x00007fdc60bd6000]
> java.lang.Thread.State: *BLOCKED* (on object monitor)
> at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
> - waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for
> java.lang.Shutdown)
> at java.lang.Runtime.exit([email protected]/Runtime.java:115)
> at java.lang.System.exit([email protected]/System.java:1746)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
> at
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
> at
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
> at java.lang.Thread.run([email protected]/Thread.java:834)
>
> Shutdown-Hook:
> "SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s
> tid=0x0000000000e99000 nid=0x18af9c in Object.wait() [0x00007fdc4fd00000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait([email protected]/Native Method)
> - waiting on <no object reference available>
> at java.lang.Thread.join([email protected]/Thread.java:1305)
> - waiting to re-lock in wait() <0x000000071165a3f8> (a
> org.apache.hadoop.ozone.util.ShutdownHookManager$1)
> at java.lang.Thread.join([email protected]/Thread.java:1379)
> at
> java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
> at
> java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
> at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
> at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
> - locked <{*}0x000000072d15c608{*}> (a java.lang.Class for
> java.lang.Shutdown)
> at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
> at
> sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
> at
> org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
> at
> sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
> at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
> at java.lang.Thread.run([email protected]/Thread.java:834)
>
> Raftserver stop:
> "shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s
> tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition [0x00007fdc46579000]
> java.lang.Thread.State: TIMED_WAITING (parking)
> at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> - parking to wait for <0x000000072c928090> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
> at
> java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
> at
> org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
> at
> org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
> at
> org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
> at
> org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
> Source)
> at
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
> at
> org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown
> Source)
> at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
> at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
> at
> org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
> at
> org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
> at
> org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
> at
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
> Source)
> at
> java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
> at
> java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
> at java.lang.Thread.run([email protected]/Thread.java:834)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]