[ 
https://issues.apache.org/jira/browse/STORM-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li updated STORM-3658:
----------------------------
    Description: 
During worker startup, it starts many threads including executor threads, and 
then registers two shutdown hooks, first hook will invoke worker::shutdown, the 
second hook will sleep for 3 seconds before force halting the whole process.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255

We have seen a case where a deadlock happened. Imaging there is a bolt in a 
topology consistently fails at prepare stage and throws RuntimeException. This 
thread will eventually invoke Runtime.getRuntime().exit. The code path is:

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514

There are three scenarios here.

*Scenario 1*
If two shutdown hooks are registered before this bolt's prepare method is 
invoked, when this bolt throws RuntimeException, it eventually invokes 
Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And then 
this bolt executor thread waits for these two shutdown hooks to finish.

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
 (we use this openJDK8u242 version)

However, what the first shutdown hooks does is to invoke worker::shutdown method
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
which will interrupt all executor threads and then wait for them to finish 

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100

However, this bolt executor thread ignores InterruptedException and continues 
to wait for the first hook to finish. Hence there is a dead lock between the 
first shutdown hook and the bolt executor thread. After 3 seconds, the second 
shutdown hook force halting the worker process. So in the log, you will see 
"Forcing Halt..."

*Scenario 2*
If the bolt executor thread invokes prepare method earlier than the main thread 
registers these two shutdown hooks, because the bolt executor thread already 
triggers shutdown, the main thread will receive an 
IllegalStateException("Shutdown in progress")

{code:java}
2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread 
main.. terminating server...
java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
{code}

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67

Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked by 
the bolt executor continues; in the meantime, the main thread also invokes 
Runtime.getRuntime.exit because of IllegalStateException. Eventually the 
process dies.

*Scenario 3*
This is the worse case. In this scenario, after the first shutdown hook is 
registered, and before the second hook is registered, the bolt executor thread 
invokes prepare method and throws a RuntimeException. 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
 
In this case, we know have the deadlock between the first shutdown hook and the 
bolt executor thread (like in scenario 1). The main thread will also have 
IllegalStateException when it tries to register the second shutdown hook, so 
main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But this 
time, since the bolt executor thread already obtained the shutdown lock 
(because it invokes Runtime.getRuntime.exit earlier), 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214

the main thread has to wait for the bolt executor thread to release this 
Shutdown.class lock. However, there is a deadlock between bolt executor thread 
and the first shutdown hook(thread), the main thread, the bolt executor thread 
and the shutdown hook are all BLOCKED. 

And in this case, since the second shutdown hook (sleepKill) is not registered, 
and every other threads like heartbeat timers still work (not yet closed), no 
one (not worker itself, not nimbus or supervisor) will kill this worker. 

So this worker stays alive but it does not function. And since this executor 
bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ 
can be quickly fill up by upstreams, the credential refresh thread is also 
blocked because it won't give up until the credential tuple is published to the 
receiveQ.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313

*How to produce the deadlock*
It can be produced by modifying the 
[WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
 to add 

{code:java}
@Override
    public void prepare(Map<String, Object> topoConf, TopologyContext context) {
        throw new RuntimeException("Runtime exception at WordCountBolt 
prepare");
    }
{code}

in the WordCount bolt so everytime the prepare method is called, this bolt 
throws a RuntimeException. 


Optionally, add a delay in between registering two shutdown hooks can help 
reproduce scenario 3.

{code:java}
Runtime.getRuntime().addShutdownHook(wrappedFunc);
        try {
            Thread.sleep(100);
        } catch (InterruptedException e) {
            LOG.info("Sleep for 100ms between hooks", e);
        }
        LOG.info("Wait for 100ms");
        Runtime.getRuntime().addShutdownHook(sleepKill);
{code}


*Solution*
There are better ways to deal with this issue. But a simple way is to register 
shutdown hooks before any executor threads are started to avoid 
IllegalStateException; And in 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
change it be a timed join so it won't wait indefinitely. This can prevent the 
deadlock from happening. 




  was:
During worker startup, it starts many threads including executor threads, and 
then registers two shutdown hooks, first hook will invoke worker::shutdown, the 
second hook will sleep for 3 seconds before force halting the whole process.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255

We have seen a case where a deadlock happened. Imaging there is a bolt in a 
topology consistently fails at prepare stage and throws RuntimeException. This 
thread will eventually invoke Runtime.getRuntime().exit. The code path is:

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514

There are three scenarios here.

*Scenario 1*
If two shutdown hooks are registered before this bolt's prepare method is 
invoked, when this bolt throws RuntimeException, it eventually invokes 
Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And then 
this bolt executor thread waits for these two shutdown hooks to finish.

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
 (we use this openJDK8u242 version)

However, what the first shutdown hooks does is to invoke worker::shutdown method
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
which will interrupt all executor threads and then wait for them to finish 

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100

However, this bolt executor thread ignores InterruptedException and continues 
to wait for the first hook to finish. Hence there is a dead lock between the 
first shutdown hook and the bolt executor thread.

*Scenario 2*
If the bolt executor thread invokes prepare method earlier than the main thread 
registers these two shutdown hooks, because the bolt executor thread already 
triggers shutdown, the main thread will receive an 
IllegalStateException("Shutdown in progress")

{code:java}
2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread 
main.. terminating server...
java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
{code}

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67

Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked by 
the bolt executor continues; in the meantime, the main thread also invokes 
Runtime.getRuntime.exit because of IllegalStateException. Eventually the 
process dies

*Scenario 3*
This is the worse case. In this scenario, after the first shutdown hook is 
registered, and before the second hook is registered, the bolt executor thread 
invokes prepare method and throws a RuntimeException. 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
 
In this case, we know have the deadlock between the first shutdown hook and the 
bolt executor thread (like in scenario 1). The main thread will also have 
IllegalStateException when it tries to register the second shutdown hook, so 
main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But this 
time, since the bolt executor thread already obtained the shutdown lock 
(because it invokes Runtime.getRuntime.exit earlier), 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214

the main thread has to wait for the bolt executor thread to release this 
Shutdown.class lock. However, there is a deadlock between bolt executor thread 
and the first shutdown hook(thread), the main thread, the bolt executor thread 
and the shutdown hook are all BLOCKED. 

And in this case, since the second shutdown hook (sleepKill) is not registered, 
and every other threads like heartbeat timers still work (not yet closed), no 
one (not worker itself, not nimbus or supervisor) will kill this worker. 

So this worker stays alive but it does not function. And since this executor 
bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ 
can be quickly fill up by upstreams, the credential refresh thread is also 
blocked because it won't give up until the credential tuple is published to the 
receiveQ.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313

*How to produce the deadlock*
It can be produced by modifying the 
[WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
 to add 

{code:java}
@Override
    public void prepare(Map<String, Object> topoConf, TopologyContext context) {
        throw new RuntimeException("Runtime exception at WordCountBolt 
prepare");
    }
{code}

in the WordCount bolt so everytime the prepare method is called, this bolt 
throws a RuntimeException. 


Optionally, add a delay in between registering two shutdown hooks can help 
reproduce scenario 3.

{code:java}
Runtime.getRuntime().addShutdownHook(wrappedFunc);
        try {
            Thread.sleep(100);
        } catch (InterruptedException e) {
            LOG.info("Sleep for 100ms between hooks", e);
        }
        LOG.info("Wait for 100ms");
        Runtime.getRuntime().addShutdownHook(sleepKill);
{code}


*Solution*
There are better ways to deal with this issue. But a simple way is to register 
shutdown hooks before any executor threads are started to avoid 
IllegalStateException; And in 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
change it be a timed join so it won't wait indefinitely. This can prevent the 
deadlock from happening. 





> Problematic worker stays alive because of a deadlock caused by ShutdownHooks
> ----------------------------------------------------------------------------
>
>                 Key: STORM-3658
>                 URL: https://issues.apache.org/jira/browse/STORM-3658
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>
> During worker startup, it starts many threads including executor threads, and 
> then registers two shutdown hooks, first hook will invoke worker::shutdown, 
> the second hook will sleep for 3 seconds before force halting the whole 
> process.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255
> We have seen a case where a deadlock happened. Imaging there is a bolt in a 
> topology consistently fails at prepare stage and throws RuntimeException. 
> This thread will eventually invoke Runtime.getRuntime().exit. The code path 
> is:
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514
> There are three scenarios here.
> *Scenario 1*
> If two shutdown hooks are registered before this bolt's prepare method is 
> invoked, when this bolt throws RuntimeException, it eventually invokes 
> Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And 
> then this bolt executor thread waits for these two shutdown hooks to finish.
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
>  (we use this openJDK8u242 version)
> However, what the first shutdown hooks does is to invoke worker::shutdown 
> method
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
> which will interrupt all executor threads and then wait for them to finish 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100
> However, this bolt executor thread ignores InterruptedException and continues 
> to wait for the first hook to finish. Hence there is a dead lock between the 
> first shutdown hook and the bolt executor thread. After 3 seconds, the second 
> shutdown hook force halting the worker process. So in the log, you will see 
> "Forcing Halt..."
> *Scenario 2*
> If the bolt executor thread invokes prepare method earlier than the main 
> thread registers these two shutdown hooks, because the bolt executor thread 
> already triggers shutdown, the main thread will receive an 
> IllegalStateException("Shutdown in progress")
> {code:java}
> 2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread 
> main.. terminating server...
> java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
> {code}
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67
> Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked 
> by the bolt executor continues; in the meantime, the main thread also invokes 
> Runtime.getRuntime.exit because of IllegalStateException. Eventually the 
> process dies.
> *Scenario 3*
> This is the worse case. In this scenario, after the first shutdown hook is 
> registered, and before the second hook is registered, the bolt executor 
> thread invokes prepare method and throws a RuntimeException. 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
>  
> In this case, we know have the deadlock between the first shutdown hook and 
> the bolt executor thread (like in scenario 1). The main thread will also have 
> IllegalStateException when it tries to register the second shutdown hook, so 
> main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But 
> this time, since the bolt executor thread already obtained the shutdown lock 
> (because it invokes Runtime.getRuntime.exit earlier), 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214
> the main thread has to wait for the bolt executor thread to release this 
> Shutdown.class lock. However, there is a deadlock between bolt executor 
> thread and the first shutdown hook(thread), the main thread, the bolt 
> executor thread and the shutdown hook are all BLOCKED. 
> And in this case, since the second shutdown hook (sleepKill) is not 
> registered, and every other threads like heartbeat timers still work (not yet 
> closed), no one (not worker itself, not nimbus or supervisor) will kill this 
> worker. 
> So this worker stays alive but it does not function. And since this executor 
> bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ 
> can be quickly fill up by upstreams, the credential refresh thread is also 
> blocked because it won't give up until the credential tuple is published to 
> the receiveQ.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313
> *How to produce the deadlock*
> It can be produced by modifying the 
> [WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
>  to add 
> {code:java}
> @Override
>     public void prepare(Map<String, Object> topoConf, TopologyContext 
> context) {
>         throw new RuntimeException("Runtime exception at WordCountBolt 
> prepare");
>     }
> {code}
> in the WordCount bolt so everytime the prepare method is called, this bolt 
> throws a RuntimeException. 
> Optionally, add a delay in between registering two shutdown hooks can help 
> reproduce scenario 3.
> {code:java}
> Runtime.getRuntime().addShutdownHook(wrappedFunc);
>         try {
>             Thread.sleep(100);
>         } catch (InterruptedException e) {
>             LOG.info("Sleep for 100ms between hooks", e);
>         }
>         LOG.info("Wait for 100ms");
>         Runtime.getRuntime().addShutdownHook(sleepKill);
> {code}
> *Solution*
> There are better ways to deal with this issue. But a simple way is to 
> register shutdown hooks before any executor threads are started to avoid 
> IllegalStateException; And in 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
> change it be a timed join so it won't wait indefinitely. This can prevent the 
> deadlock from happening. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to