P. Taylor Goetz created STORM-2176: -------------------------------------- Summary: Workers do not shutdown cleanly and worker hooks don't run when a topology is killed Key: STORM-2176 URL: https://issues.apache.org/jira/browse/STORM-2176 Project: Apache Storm Issue Type: Bug Affects Versions: 1.0.0, 1.0.1, 1.0.2 Reporter: P. Taylor Goetz Priority: Critical
This appears to have been introduced in the 1.0.0 release. The issues does not seem to affect 0.10.2. When a topology is killed and workers receive the notification to shutdown, they do not shutdown cleanly, so worker hooks never get invoked. When a worker shuts down cleanly, the worker logs should contain entries such as the following: {code} 2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread 2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources 2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources 2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster state context 2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700 {code} In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) this does not happen -- the worker shutdown process appears to get stuck shutting down executors (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666), no further log messages are seen in the worker log, and worker hooks do not run. There are two properties that affect how workers exit. The first is the configuration property {{supervisor.worker.shutdown.sleep.secs}}, which defaults to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with {{kill -9}}. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9). The second property is a hard-coded 1 second delay (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463) added as a shutdown hook that will call {{Runtime.halt()}} if the delay is exceeded. When this happens, the supervisor will log that the worker terminated with exit code 20 (hard-coded). Side Note: The hardcoded halt delay in worker.clj and the default value for {{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably be changed since it creates a race to see whether the supervisor delay or the worker delay wins. To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow plenty of time for the worker to exit gracefully, and deployed and killed a topology. In this case the supervisor consistently reported exit code 20 for the worker, indicating the hard-coded shutdown hook caused the worker to exit. I thought the hard-coded 1 second shutdown hook delay might not be long enough for the worker to shutdown cleanly. To test that hypothesis, I changed the hard-code delay to 10 seconds, leaving {{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor reported an exit code of 20 for the worker, and there were no log messages indicating the worker had exited cleanly and that the worker hook had run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)