[ 
https://issues.apache.org/jira/browse/STORM-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15344582#comment-15344582
 ] 

Nico Meyer commented on STORM-1879:
-----------------------------------

I am not sure about the problem Jeyendran Balakrishnan described, but I already 
worked out what causes the other problem.
I described my findings in one of my last comments. Maybe I didn't describe it 
very well, so let me try to summarize it differently:

The problem occurs if {{kill-existing-workers-with-change-in-components}} tries 
to kill a worker that has been started but has not yet written a single 
heartbeat, so {{read-allocated-workers}} returns a state of {{:not-started}} 
and also no information about the port this worker is assigned to. Therefore in 
this line 
https://github.com/apache/storm/blob/v1.0.1/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L536
 the second argument to {{shutdown-worker}} will be {{nil}}.
This leads to:
# {{pids}} is empty 
(https://github.com/apache/storm/blob/v1.0.1/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L277),
 therefore the worker is not killed
# {{try-cleanup-worker}} is also called with an {{id}} of {{nil}} 
(https://github.com/apache/storm/blob/v1.0.1/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L303)
# which will delete the local state files for all workers, since 
{{org.apache.storm.config/worker-root conf nil}} returns for example 
'/var/lib/storm/workers/' and this path is deleted 
(https://github.com/apache/storm/blob/v1.0.1/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L265)

The heartbeat files of the workers that are still running will eventually be 
recreated at the heartbeat, but the PID files are gone for good at that point, 
so even stopping the topology won't help. Only manually killing the worker 
processes will solve the problem.

> Supervisor may not shut down workers cleanly
> --------------------------------------------
>
>                 Key: STORM-1879
>                 URL: https://issues.apache.org/jira/browse/STORM-1879
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.1
>            Reporter: Stig Rohde Døssing
>         Attachments: fix_missing_worker_pid.patch, nimbus-supervisor.zip, 
> supervisor.log
>
>
> We've run into a strange issue with a zombie worker process. It looks like 
> the worker pid file somehow got deleted without the worker process shutting 
> down. This causes the supervisor to try repeatedly to kill the worker 
> unsuccessfully, and means multiple workers may be assigned to the same port. 
> The worker root folder sticks around because the worker is still heartbeating 
> to it.
> It may or may not be related that we've seen Nimbus occasionally enter an 
> infinite loop of printing logs similar to the below.
> {code}
> 2016-05-19 14:55:14.196 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.210 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.218 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.256 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.273 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.316 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> {code}
> Which continues until Nimbus is rebooted. We also see repeating blocks 
> similar to the logs below.
> {code}
> 2016-06-02 07:45:03.656 o.a.s.d.nimbus [INFO] Cleaning up 
> ZendeskTicketTopology-127-1464780171
> 2016-06-02 07:45:04.132 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormjar.jar)
> 2016-06-02 07:45:04.144 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormconf.ser)
> 2016-06-02 07:45:04.155 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormcode.ser)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to