[ 
https://issues.apache.org/jira/browse/STORM-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15803954#comment-15803954
 ] 

Jungtaek Lim commented on STORM-2176:
-------------------------------------

I'm testing on multi-lang topology now, and found it also works well. 
Even NoOutputException is occurred after shutdown hook is started, other 
threads do their cleanup, and one of shutdown hook calls 
Runtime.getRuntime.halt(20) after 1 seconds to force shutting down JVM.

So I guess we don't have an issue except race condition for killing between 
supervisor and worker itself. But silently dropped logs make us confusing.

I'll submit a patch regarding adjusting supervisor sleep secs, and also 
disabling shutdown hook for Log4j2.

> Workers do not shutdown cleanly and worker hooks don't run when a topology is 
> killed
> ------------------------------------------------------------------------------------
>
>                 Key: STORM-2176
>                 URL: https://issues.apache.org/jira/browse/STORM-2176
>             Project: Apache Storm
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.0.1, 1.0.2
>            Reporter: P. Taylor Goetz
>            Priority: Critical
>
> This appears to have been introduced in the 1.0.0 release. The issues does 
> not seem to affect 0.10.2.
> When a topology is killed and workers receive the notification to shutdown, 
> they do not shutdown cleanly, so worker hooks never get invoked.
> When a worker shuts down cleanly, the worker logs should contain entries such 
> as the following:
> {code}
> 2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
> 2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
> 2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
> 2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster 
> state context
> 2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker 
> exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
> {code}
> In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) 
> this does not happen -- the worker shutdown process appears to get stuck 
> shutting down executors 
> (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666),
>  no further log messages are seen in the worker log, and worker hooks do not 
> run.
> There are two properties that affect how workers exit. The first is the 
> configuration property {{supervisor.worker.shutdown.sleep.secs}}, which 
> defaults to 1 second. This corresponds to how long the supervisor will wait 
> for a worker to exit gracefully before forcibly killing it with {{kill -9}}. 
> When this happens the supervisor will log that the worker terminated with 
> exit code 137 (128 + 9).
> The second property is a hard-coded 1 second delay 
> (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463)
>  added as a shutdown hook that will call {{Runtime.halt()}} if the delay is 
> exceeded. When this happens, the supervisor will log that the worker 
> terminated with exit code 20 (hard-coded).
> Side Note: The hardcoded halt delay in worker.clj and the default value for 
> {{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably 
> be changed since it creates a race to see whether the supervisor delay or the 
> worker delay wins.
> To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow 
> plenty of time for the worker to exit gracefully, and deployed and killed a 
> topology. In this case the supervisor consistently reported exit code 20 for 
> the worker, indicating the hard-coded shutdown hook caused the worker to exit.
> I thought the hard-coded 1 second shutdown hook delay might not be long 
> enough for the worker to shutdown cleanly. To test that hypothesis, I changed 
> the hard-code delay to 10 seconds, leaving 
> {{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor 
> reported an exit code of 20 for the worker, and there were no log messages 
> indicating the worker had exited cleanly and that the worker hook had run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to