[
https://issues.apache.org/jira/browse/STORM-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ethan Li updated STORM-3658:
----------------------------
Summary: Problematic worker stays alive because of a deadlock caused by
ShutdownHooks (was: Deadlock caused by ShutdownHooks)
> Problematic worker stays alive because of a deadlock caused by ShutdownHooks
> ----------------------------------------------------------------------------
>
> Key: STORM-3658
> URL: https://issues.apache.org/jira/browse/STORM-3658
> Project: Apache Storm
> Issue Type: Bug
> Reporter: Ethan Li
> Assignee: Ethan Li
> Priority: Major
>
> During worker startup, it starts many threads including executor threads, and
> then registers two shutdown hooks, first hook will invoke worker::shutdown,
> the second hook will sleep for 3 seconds before force halting the whole
> process.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L141-L147
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/Executor.java#L253-L255
> We have seen a case where a deadlock happened. Imaging there is a bolt in a
> topology consistently fails at prepare stage and throws RuntimeException.
> This thread will eventually invoke Runtime.getRuntime().exit. The code path
> is:
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L406-L407
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/error/ReportErrorAndDie.java#L41
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L510-L514
> There are three scenarios here.
> *Scenario 1*
> If two shutdown hooks are registered before this bolt's prepare method is
> invoked, when this bolt throws RuntimeException, it eventually invokes
> Runtime.getRuntime().exit, which triggers two shutdown hooks to start. And
> then this bolt executor thread waits for these two shutdown hooks to finish.
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L104-L111
> (we use this openJDK8u242 version)
> However, what the first shutdown hooks does is to invoke worker::shutdown
> method
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/daemon/worker/Worker.java#L467-L469
> which will interrupt all executor threads and then wait for them to finish
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L95-L100
> However, this bolt executor thread ignores InterruptedException and continues
> to wait for the first hook to finish. Hence there is a dead lock between the
> first shutdown hook and the bolt executor thread.
> *Scenario 2*
> If the bolt executor thread invokes prepare method earlier than the main
> thread registers these two shutdown hooks, because the bolt executor thread
> already triggers shutdown, the main thread will receive an
> IllegalStateException("Shutdown in progress")
> {code:java}
> 2020-06-18 11:57:29.159 o.a.s.u.Utils main [ERROR] Received error in thread
> main.. terminating server...
> java.lang.Error: java.lang.IllegalStateException: Shutdown in progress
> {code}
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L1011-L1018
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/ApplicationShutdownHooks.java#L63-L67
> Since there is no shutdown hook registered, Runtime.getRuntime.exit invoked
> by the bolt executor continues; in the meantime, the main thread also invokes
> Runtime.getRuntime.exit because of IllegalStateException. Eventually the
> process dies
> *Scenario 3*
> This is the worse case. In this scenario, after the first shutdown hook is
> registered, and before the second hook is registered, the bolt executor
> thread invokes prepare method and throws a RuntimeException.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/Utils.java#L355-L356
>
> In this case, we know have the deadlock between the first shutdown hook and
> the bolt executor thread (like in scenario 1). The main thread will also have
> IllegalStateException when it tries to register the second shutdown hook, so
> main thread also invokes Runtime.getRuntime.exit (like in scenario 2). But
> this time, since the bolt executor thread already obtained the shutdown lock
> (because it invokes Runtime.getRuntime.exit earlier),
> https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/jdk8u242-b08/jdk/src/share/classes/java/lang/Shutdown.java#L208-L214
> the main thread has to wait for the bolt executor thread to release this
> Shutdown.class lock. However, there is a deadlock between bolt executor
> thread and the first shutdown hook(thread), the main thread, the bolt
> executor thread and the shutdown hook are all BLOCKED.
> And in this case, since the second shutdown hook (sleepKill) is not
> registered, and every other threads like heartbeat timers still work (not yet
> closed), no one (not worker itself, not nimbus or supervisor) will kill this
> worker.
> So this worker stays alive but it does not function. And since this executor
> bolt is blocked, it doesn't consume tuples from the receiveQ, so the receiveQ
> can be quickly fill up by upstreams, the credential refresh thread is also
> blocked because it won't give up until the credential tuple is published to
> the receiveQ.
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L68-L69
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/utils/JCQueue.java#L300-L313
> *How to produce the deadlock*
> It can be produced by modifying the
> [WordCountTopology|https://github.com/apache/storm/blob/v2.1.0/examples/storm-starter/src/jvm/org/apache/storm/starter/WordCountTopology.java]
> to add
> {code:java}
> @Override
> public void prepare(Map<String, Object> topoConf, TopologyContext
> context) {
> throw new RuntimeException("Runtime exception at WordCountBolt
> prepare");
> }
> {code}
> in the WordCount bolt so everytime the prepare method is called, this bolt
> throws a RuntimeException.
> Optionally, add a delay in between registering two shutdown hooks can help
> reproduce scenario 3.
> {code:java}
> Runtime.getRuntime().addShutdownHook(wrappedFunc);
> try {
> Thread.sleep(100);
> } catch (InterruptedException e) {
> LOG.info("Sleep for 100ms between hooks", e);
> }
> LOG.info("Wait for 100ms");
> Runtime.getRuntime().addShutdownHook(sleepKill);
> {code}
> *Solution*
> There are better ways to deal with this issue. But a simple way is to
> register shutdown hooks before any executor threads are started to avoid
> IllegalStateException; And in
> https://github.com/apache/storm/blob/v2.1.0/storm-client/src/jvm/org/apache/storm/executor/ExecutorShutdown.java#L98-L101
> change it be a timed join so it won't wait indefinitely. This can prevent the
> deadlock from happening.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)