[ 
https://issues.apache.org/jira/browse/STORM-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Li updated STORM-3658:
----------------------------
    Summary: Problematic worker stays alive because of a deadlock caused by 
ShutdownHooks  (was: Deadlock caused by ShutdownHooks)

> Problematic worker stays alive because of a deadlock caused by ShutdownHooks
> ----------------------------------------------------------------------------
>
>                 Key: STORM-3658
>                 URL: https://issues.apache.org/jira/browse/STORM-3658
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: Ethan Li
>            Assignee: Ethan Li
>            Priority: Major
>
> During worker startup, it starts many threads including executor threads, and 
> then registers two shutdown hooks, first hook will invoke worker::shutdown, 
> the second hook will sleep for 3 seconds before force halting the whole 
> process.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255
> We have seen a case where a deadlock happened. Imaging there is a bolt in a 
> topology consistently fails at prepare stage and throws RuntimeException. 
> This thread will eventually invoke Runtime.getRuntime().exit. The code path 
> is:
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514
> There are three scenarios here.
> *Scenario 1*
> If two shutdown hooks are registered before this bolt's prepare method is 
> invoked, when this bolt throws RuntimeException, it eventually invokes 
> Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And 
> then this bolt executor thread waits for these two shutdown hooks to finish.
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
>  (we use this openJDK8u242 version)
> However, what the first shutdown hooks does is to invoke worker::shutdown 
> method
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
> which will interrupt all executor threads and then wait for them to finish 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100
> However, this bolt executor thread ignores InterruptedException and continues 
> to wait for the first hook to finish. Hence there is a dead lock between the 
> first shutdown hook and the bolt executor thread.
> *Scenario 2*
> If the bolt executor thread invokes prepare method earlier than the main 
> thread registers these two shutdown hooks, because the bolt executor thread 
> already triggers shutdown, the main thread will receive an 
> IllegalStateException("Shutdown in progress")
> {code:java}
> 2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread 
> main.. terminating server...
> java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
> {code}
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67
> Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked 
> by the bolt executor continues; in the meantime, the main thread also invokes 
> Runtime.getRuntime.exit because of IllegalStateException. Eventually the 
> process dies
> *Scenario 3*
> This is the worse case. In this scenario, after the first shutdown hook is 
> registered, and before the second hook is registered, the bolt executor 
> thread invokes prepare method and throws a RuntimeException. 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
>  
> In this case, we know have the deadlock between the first shutdown hook and 
> the bolt executor thread (like in scenario 1). The main thread will also have 
> IllegalStateException when it tries to register the second shutdown hook, so 
> main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But 
> this time, since the bolt executor thread already obtained the shutdown lock 
> (because it invokes Runtime.getRuntime.exit earlier), 
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214
> the main thread has to wait for the bolt executor thread to release this 
> Shutdown.class lock. However, there is a deadlock between bolt executor 
> thread and the first shutdown hook(thread), the main thread, the bolt 
> executor thread and the shutdown hook are all BLOCKED. 
> And in this case, since the second shutdown hook (sleepKill) is not 
> registered, and every other threads like heartbeat timers still work (not yet 
> closed), no one (not worker itself, not nimbus or supervisor) will kill this 
> worker. 
> So this worker stays alive but it does not function. And since this executor 
> bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ 
> can be quickly fill up by upstreams, the credential refresh thread is also 
> blocked because it won't give up until the credential tuple is published to 
> the receiveQ.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313
> *How to produce the deadlock*
> It can be produced by modifying the 
> [WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
>  to add 
> {code:java}
> @Override
>     public void prepare(Map<String, Object> topoConf, TopologyContext 
> context) {
>         throw new RuntimeException("Runtime exception at WordCountBolt 
> prepare");
>     }
> {code}
> in the WordCount bolt so everytime the prepare method is called, this bolt 
> throws a RuntimeException. 
> Optionally, add a delay in between registering two shutdown hooks can help 
> reproduce scenario 3.
> {code:java}
> Runtime.getRuntime().addShutdownHook(wrappedFunc);
>         try {
>             Thread.sleep(100);
>         } catch (InterruptedException e) {
>             LOG.info("Sleep for 100ms between hooks", e);
>         }
>         LOG.info("Wait for 100ms");
>         Runtime.getRuntime().addShutdownHook(sleepKill);
> {code}
> *Solution*
> There are better ways to deal with this issue. But a simple way is to 
> register shutdown hooks before any executor threads are started to avoid 
> IllegalStateException; And in 
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
> change it be a timed join so it won't wait indefinitely. This can prevent the 
> deadlock from happening. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to