[ 
https://issues.apache.org/jira/browse/HDDS-10643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashish Kumar updated HDDS-10643:
--------------------------------
    Description: 
When SCM stop is called, SCM first stops raft server which internally invokes 
state machine to 
[close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
 

Since this call is from ratis, statemachine tries to 
[terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
 SCM by invoking 
[[System.exit|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]|#L138]]call.
       

But System.exit waits for shutdown-hook to gets completed. Shutdown-hook is 
waiting for main thread(waiting for raft server close) to complete. So both 
threads are waiting for each other and leads to deadlock situation.

It depends on the timeout for shutdown-hook after that it interrupts the thread 
abruptly.

Currently after this SCM shuts down which is not graceful exit.

Below is the stack:

StateMachineUpdater:
"fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater" 
#58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s tid=0x00007fdc96449000 
nid=0x188160 waiting for monitor entry  [0x00007fdc60bd6000]
   java.lang.Thread.State: *BLOCKED* (on object monitor)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
 - waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Runtime.exit([email protected]/Runtime.java:115)
at java.lang.System.exit([email protected]/System.java:1746)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
at 
org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
at java.lang.Thread.run([email protected]/Thread.java:834)
 
Shutdown-Hook:
"SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s 
tid=0x0000000000e99000 nid=0x18af9c in Object.wait()  [0x00007fdc4fd00000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait([email protected]/Native Method)
    - waiting on <no object reference available>
    at java.lang.Thread.join([email protected]/Thread.java:1305)
    - waiting to re-lock in wait() <0x000000071165a3f8> (a 
org.apache.hadoop.ozone.util.ShutdownHookManager$1)
    at java.lang.Thread.join([email protected]/Thread.java:1379)
    at 
java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
    at 
java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
    at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
    at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
    - locked <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
    at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
    at 
sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
    at 
org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
    at 
sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
    at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
    at java.lang.Thread.run([email protected]/Thread.java:834)
 
Raftserver stop:
"shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s 
tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition  [0x00007fdc46579000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x000000072c928090> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at 
java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
    at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
    at 
org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
    at 
org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
    at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
    at 
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
 Source)
    at 
org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
    at 
org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown 
Source)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
    at 
org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
    at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
    at 
org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
 Source)
    at 
java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
    at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run([email protected]/Thread.java:834)

  was:
When SCM stop is called, SCM first stops raft server which internally invokes 
state machine to 
[close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
 

Since this call is from ratis, statemachine tries to 
[terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
 SCM by invoking 
[System.exit|[https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]
 
|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138].]call.
        But System.exit waits for shutdown-hook to gets completed. 
Shutdown-hook is waiting for main thread(waiting for raft server close) to 
complete. So both threads are waiting for each other and leads to deadlock 
situation.

It depends on the timeout for shutdown-hook after that it interrupts the thread 
abruptly.

Currently after this SCM shuts down which is not graceful exit.

Below is the stack:

StateMachineUpdater:
"fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater" 
#58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s tid=0x00007fdc96449000 
nid=0x188160 waiting for monitor entry  [0x00007fdc60bd6000]
   java.lang.Thread.State: *BLOCKED* (on object monitor)
at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
- waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Runtime.exit([email protected]/Runtime.java:115)
at java.lang.System.exit([email protected]/System.java:1746)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
at org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
at 
org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
at java.lang.Thread.run([email protected]/Thread.java:834)
 
Shutdown-Hook:
"SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s 
tid=0x0000000000e99000 nid=0x18af9c in Object.wait()  [0x00007fdc4fd00000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait([email protected]/Native Method)
    - waiting on <no object reference available>
    at java.lang.Thread.join([email protected]/Thread.java:1305)
    - waiting to re-lock in wait() <0x000000071165a3f8> (a 
org.apache.hadoop.ozone.util.ShutdownHookManager$1)
    at java.lang.Thread.join([email protected]/Thread.java:1379)
    at 
java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
    at 
java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
    at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
    at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
    - locked <{*}0x000000072d15c608{*}> (a java.lang.Class for 
java.lang.Shutdown)
    at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
    at 
sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
    at 
org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
    at 
sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
    at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
    at java.lang.Thread.run([email protected]/Thread.java:834)
 
Raftserver stop:
"shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s 
tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition  [0x00007fdc46579000]
   java.lang.Thread.State: TIMED_WAITING (parking)
    at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
    - parking to wait for  <0x000000072c928090> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    at 
java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
    at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
    at 
java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
    at 
org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
    at 
org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
    at 
org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
    at 
org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
 Source)
    at 
org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
    at 
org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown 
Source)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
    at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
    at 
org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
    at 
org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
    at 
org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
    at 
org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
 Source)
    at 
java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
    at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
    at java.lang.Thread.run([email protected]/Thread.java:834)


> SCM fails to stop gracefully
> ----------------------------
>
>                 Key: HDDS-10643
>                 URL: https://issues.apache.org/jira/browse/HDDS-10643
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ashish Kumar
>            Assignee: Ashish Kumar
>            Priority: Major
>
> When SCM stop is called, SCM first stops raft server which internally invokes 
> state machine to 
> [close|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/ha/SCMStateMachine.java#L444].
>  
> Since this call is from ratis, statemachine tries to 
> [terminate|https://github.com/apache/ozone/blob/3467db1b1cc581a21caeb8648587fcbf35bbfdfa/hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/StorageContainerManager.java#L1796]
>  SCM by invoking 
> [[System.exit|https://github.com/apache/ratis/blob/f40424422b692349b5496ee30e24335c8186093b/ratis-common/src/main/java/org/apache/ratis/util/ExitUtils.java#L138]|#L138]]call.
>        
> But System.exit waits for shutdown-hook to gets completed. Shutdown-hook is 
> waiting for main thread(waiting for raft server close) to complete. So both 
> threads are waiting for each other and leads to deadlock situation.
> It depends on the timeout for shutdown-hook after that it interrupts the 
> thread abruptly.
> Currently after this SCM shuts down which is not graceful exit.
> Below is the stack:
> StateMachineUpdater:
> "fa8d8513-6849-46f1-8088-e091f305efda@group-4D11F4CFC172-StateMachineUpdater" 
> #58 daemon prio=5 os_prio=0 cpu=520.27ms elapsed=820.33s 
> tid=0x00007fdc96449000 nid=0x188160 waiting for monitor entry  
> [0x00007fdc60bd6000]
>    java.lang.Thread.State: *BLOCKED* (on object monitor)
> at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
>  - waiting to lock <{*}0x000000072d15c608{*}> (a java.lang.Class for 
> java.lang.Shutdown)
> at java.lang.Runtime.exit([email protected]/Runtime.java:115)
> at java.lang.System.exit([email protected]/System.java:1746)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:138)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:151)
> at org.apache.ratis.util.ExitUtils.terminate(ExitUtils.java:155)
> at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.shutDown(StorageContainerManager.java:1758)
> at 
> org.apache.hadoop.hdds.scm.ha.SCMStateMachine.close(SCMStateMachine.java:439)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.stop(StateMachineUpdater.java:134)
> at 
> org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:189)
> at java.lang.Thread.run([email protected]/Thread.java:834)
>  
> Shutdown-Hook:
> "SIGTERM handler" #468 daemon prio=9 os_prio=0 cpu=1.24ms elapsed=20.21s 
> tid=0x0000000000e99000 nid=0x18af9c in Object.wait()  [0x00007fdc4fd00000]
>    java.lang.Thread.State: WAITING (on object monitor)
>     at java.lang.Object.wait([email protected]/Native Method)
>     - waiting on <no object reference available>
>     at java.lang.Thread.join([email protected]/Thread.java:1305)
>     - waiting to re-lock in wait() <0x000000071165a3f8> (a 
> org.apache.hadoop.ozone.util.ShutdownHookManager$1)
>     at java.lang.Thread.join([email protected]/Thread.java:1379)
>     at 
> java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
>     at 
> java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
>     at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
>     at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
>     - locked <{*}0x000000072d15c608{*}> (a java.lang.Class for 
> java.lang.Shutdown)
>     at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
>     at 
> sun.misc.Signal$SunMiscHandler.handle([email protected]/Signal.java:228)
>     at 
> org.apache.hadoop.hdds.utils.SignalLogger$Handler.handle(SignalLogger.java:62)
>     at 
> sun.misc.Signal$InternalMiscHandler.handle([email protected]/Signal.java:198)
>     at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
>     at java.lang.Thread.run([email protected]/Thread.java:834)
>  
> Raftserver stop:
> "shutdown-hook-0" #469 daemon prio=5 os_prio=0 cpu=59.00ms elapsed=20.19s 
> tid=0x00007fdc795a6800 nid=0x18afa0 waiting on condition  [0x00007fdc46579000]
>    java.lang.Thread.State: TIMED_WAITING (parking)
>     at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
>     - parking to wait for  <0x000000072c928090> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>     at 
> java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:234)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2123)
>     at 
> java.util.concurrent.ThreadPoolExecutor.awaitTermination([email protected]/ThreadPoolExecutor.java:1454)
>     at 
> org.apache.ratis.util.ConcurrentUtils.{*}shutdownAndWait{*}(ConcurrentUtils.java:143)
>     at 
> org.apache.ratis.util.ConcurrentUtils.shutdownAndWait(ConcurrentUtils.java:135)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.lambda$close$6(RaftServerProxy.java:432)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy$$Lambda$1122/0x0000000800a7e440.run(Unknown
>  Source)
>     at 
> org.apache.ratis.util.LifeCycle.lambda$checkStateAndClose$4(LifeCycle.java:299)
>     at 
> org.apache.ratis.util.LifeCycle$$Lambda$892/0x0000000800960c40.get(Unknown 
> Source)
>     at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:319)
>     at org.apache.ratis.util.LifeCycle.checkStateAndClose(LifeCycle.java:297)
>     at 
> org.apache.ratis.server.impl.RaftServerProxy.close(RaftServerProxy.java:415)
>     at 
> org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.stop(SCMRatisServerImpl.java:255)
>     at 
> org.apache.hadoop.hdds.scm.ha.SCMHAManagerImpl.stop(SCMHAManagerImpl.java:392)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManager.stop(StorageContainerManager.java:1705)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper.lambda$start$0(StorageContainerManagerStarter.java:175)
>     at 
> org.apache.hadoop.hdds.scm.server.StorageContainerManagerStarter$SCMStarterHelper$$Lambda$726/0x0000000800817840.run(Unknown
>  Source)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:515)
>     at 
> java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
>     at java.lang.Thread.run([email protected]/Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to