P. Taylor Goetz created STORM-2176:
--------------------------------------
Summary: Workers do not shutdown cleanly and worker hooks don't
run when a topology is killed
Key: STORM-2176
URL: https://issues.apache.org/jira/browse/STORM-2176
Project: Apache Storm
Issue Type: Bug
Affects Versions: 1.0.0, 1.0.1, 1.0.2
Reporter: P. Taylor Goetz
Priority: Critical
This appears to have been introduced in the 1.0.0 release. The issues does not
seem to affect 0.10.2.
When a topology is killed and workers receive the notification to shutdown,
they do not shutdown cleanly, so worker hooks never get invoked.
When a worker shuts down cleanly, the worker logs should contain entries such
as the following:
{code}
2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster
state context
2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker
exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
{code}
In the 1.0.x line of releases (and presumably 1.x, though I haven't checked)
this does not happen -- the worker shutdown process appears to get stuck
shutting down executors
(https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666),
no further log messages are seen in the worker log, and worker hooks do not
run.
There are two properties that affect how workers exit. The first is the
configuration property {{supervisor.worker.shutdown.sleep.secs}}, which
defaults to 1 second. This corresponds to how long the supervisor will wait for
a worker to exit gracefully before forcibly killing it with {{kill -9}}. When
this happens the supervisor will log that the worker terminated with exit code
137 (128 + 9).
The second property is a hard-coded 1 second delay
(https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463)
added as a shutdown hook that will call {{Runtime.halt()}} if the delay is
exceeded. When this happens, the supervisor will log that the worker terminated
with exit code 20 (hard-coded).
Side Note: The hardcoded halt delay in worker.clj and the default value for
{{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably
be changed since it creates a race to see whether the supervisor delay or the
worker delay wins.
To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow
plenty of time for the worker to exit gracefully, and deployed and killed a
topology. In this case the supervisor consistently reported exit code 20 for
the worker, indicating the hard-coded shutdown hook caused the worker to exit.
I thought the hard-coded 1 second shutdown hook delay might not be long enough
for the worker to shutdown cleanly. To test that hypothesis, I changed the
hard-code delay to 10 seconds, leaving
{{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor
reported an exit code of 20 for the worker, and there were no log messages
indicating the worker had exited cleanly and that the worker hook had run.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)