Ethan Li created STORM-3658:
-------------------------------

             Summary: Deadlock caused by ShutdownHooks
                 Key: STORM-3658
                 URL: https://issues.apache.org/jira/browse/STORM-3658
             Project: Apache Storm
          Issue Type: Bug
            Reporter: Ethan Li
            Assignee: Ethan Li


During worker startup, it starts many threads including executor threads, and 
then registers two shutdown hooks, first hook will invoke worker::shutdown, the 
second hook will sleep for 3 seconds before force halting the whole process.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255

We have seen a case where a deadlock happened. Imaging there is a bolt in a 
topology consistently fails at prepare stage and throws RuntimeException. This 
thread will eventually invoke Runtime.getRuntime().exit. The code path is:

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514

There are three scenarios here.

*Scenario 1*
If two shutdown hooks are registered before this bolt's prepare method is 
invoked, when this bolt throws RuntimeException, it eventually invokes 
Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And then 
this bolt executor thread waits for these two shutdown hooks to finish.

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
 (we use this openJDK8u242 version)

However, what the first shutdown hooks does is to invoke worker::shutdown method
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
which will interrupt all executor threads and then wait for them to finish 

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100

However, this bolt executor thread ignores InterruptedException and continues 
to wait for the first hook to finish. Hence there is a dead lock between the 
first shutdown hook and the bolt executor thread.

*Scenario 2*
If the bolt executor thread invokes prepare method earlier than the main thread 
registers these two shutdown hooks, because the bolt executor thread already 
triggers shutdown, the main thread will receive an 
IllegalStateException("Shutdown in progress")

{code:java}
2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread 
main.. terminating server...
java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
{code}

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018

https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67

Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked by 
the bolt executor continues; in the meantime, the main thread also invokes 
Runtime.getRuntime.exit because of IllegalStateException. Eventually the 
process dies

*Scenario 3*
This is the worse case. In this scenario, after the first shutdown hook is 
registered, and before the second hook is registered, the bolt executor thread 
invokes prepare method and throws a RuntimeException. 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
 
In this case, we know have the deadlock between the first shutdown hook and the 
bolt executor thread (like in scenario 1). The main thread will also have 
IllegalStateException when it tries to register the second shutdown hook, so 
main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But this 
time, since the bolt executor thread already obtained the shutdown lock 
(because it invokes Runtime.getRuntime.exit earlier), 
https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214

the main thread has to wait for the bolt executor thread to release this 
Shutdown.class lock. However, there is a deadlock between bolt executor thread 
and the first shutdown hook(thread), the main thread, the bolt executor thread 
and the shutdown hook are all BLOCKED. 

And in this case, since the second shutdown hook (sleepKill) is not registered, 
and every other threads like heartbeat timers still work (not yet closed), no 
one (not worker itself, not nimbus or supervisor) will kill this worker. 

So this worker stays alive but it does not function. And since this executor 
bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ 
can be quickly fill up by upstreams, the credential refresh thread is also 
blocked because it won't give up until the credential tuple is published to the 
receiveQ.

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69

https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313

*How to produce the deadlock*
It can be produced by modifying the 
[WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
 to add 

{code:java}
@Override
    public void prepare(Map<String, Object> topoConf, TopologyContext context) {
        throw new RuntimeException("Runtime exception at WordCountBolt 
prepare");
    }
{code}

in the WordCount bolt so everytime the prepare method is called, this bolt 
throws a RuntimeException. 


Optionally, add a delay in between registering two shutdown hooks can help 
reproduce scenario 3.

{code:java}
Runtime.getRuntime().addShutdownHook(wrappedFunc);
        try {
            Thread.sleep(100);
        } catch (InterruptedException e) {
            LOG.info("Sleep for 100ms between hooks", e);
        }
        LOG.info("Wait for 100ms");
        Runtime.getRuntime().addShutdownHook(sleepKill);
{code}


*Solution*
There are better ways to deal with this issue. But a simple way is to register 
shutdown hooks before any executor threads are started to avoid 
IllegalStateException; And in 
https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
change it be a timed join so it won't wait indefinitely. This can prevent the 
deadlock from happening. 






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to