P. Taylor Goetz created STORM-2176:
--------------------------------------

             Summary: Workers do not shutdown cleanly and worker hooks don't 
run when a topology is killed
                 Key: STORM-2176
                 URL: https://issues.apache.org/jira/browse/STORM-2176
             Project: Apache Storm
          Issue Type: Bug
    Affects Versions: 1.0.0, 1.0.1, 1.0.2
            Reporter: P. Taylor Goetz
            Priority: Critical


This appears to have been introduced in the 1.0.0 release. The issues does not 
seem to affect 0.10.2.

When a topology is killed and workers receive the notification to shutdown, 
they do not shutdown cleanly, so worker hooks never get invoked.

When a worker shuts down cleanly, the worker logs should contain entries such 
as the following:

{code}
2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster 
state context
2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker 
exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
{code}

In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) 
this does not happen -- the worker shutdown process appears to get stuck 
shutting down executors 
(https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666),
 no further log messages are seen in the worker log, and worker hooks do not 
run.

There are two properties that affect how workers exit. The first is the 
configuration property {{supervisor.worker.shutdown.sleep.secs}}, which 
defaults to 1 second. This corresponds to how long the supervisor will wait for 
a worker to exit gracefully before forcibly killing it with {{kill -9}}. When 
this happens the supervisor will log that the worker terminated with exit code 
137 (128 + 9).

The second property is a hard-coded 1 second delay 
(https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463)
 added as a shutdown hook that will call {{Runtime.halt()}} if the delay is 
exceeded. When this happens, the supervisor will log that the worker terminated 
with exit code 20 (hard-coded).

Side Note: The hardcoded halt delay in worker.clj and the default value for 
{{supervisor.worker.shutdown.sleep.secs}} both being 1 second should probably 
be changed since it creates a race to see whether the supervisor delay or the 
worker delay wins.


To test this, I set {{supervisor.worker.shutdown.sleep.secs}} to 15 to allow 
plenty of time for the worker to exit gracefully, and deployed and killed a 
topology. In this case the supervisor consistently reported exit code 20 for 
the worker, indicating the hard-coded shutdown hook caused the worker to exit.

I thought the hard-coded 1 second shutdown hook delay might not be long enough 
for the worker to shutdown cleanly. To test that hypothesis, I changed the 
hard-code delay to 10 seconds, leaving 
{{supervisor.worker.shutdown.sleep.secs}} at 15 seconds. Again supervisor 
reported an exit code of 20 for the worker, and there were no log messages 
indicating the worker had exited cleanly and that the worker hook had run.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to