[jira] [Commented] (STORM-1879) Supervisor may not shut down workers cleanly

Nico Meyer (JIRA) Wed, 15 Jun 2016 03:49:21 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331525#comment-15331525
 ]


Nico Meyer commented on STORM-1879:
-----------------------------------

Seems like the problem is this:

 {quote}2016-06-14 18:31:00.741 o.a.s.d.supervisor [INFO] Launching worker with 
assignment {:storm-id "realtime_stats-4-1465921242", :executors [[85 85] [39 
39] [128 128]], :resources #object[org.apache.storm.generated.W
orkerResources 0x58f5aa7a "WorkerResources(mem_on_heap:0.0, mem_off_heap:0.0, 
cpu:0.0)"]} for this supervisor ee56fb9d-2657-4d3f-b52a-1ae4abae85f7 on port 
6702 with id a2566676-8c17-4314-8eec-b4a5cdeb17c9
 {quote}

and this

bq. 2016-06-14 18:31:04.464 o.a.s.d.supervisor [DEBUG] Worker 
a2566676-8c17-4314-8eec-b4a5cdeb17c9 is :not-started: nil at supervisor 
time-secs 1465921864

This last line is right after the assignment from nimbus changed. The worker on 
port 6702 has different executors and therefore should be killed. But the 
currently running worker on port 6702 with id 
a2566676-8c17-4314-8eec-b4a5cdeb17c9 was only launched ~3.7s ago, and has not 
written any heartbeat files yet. Here 
https://github.com/apache/storm/blob/v1.0.1/storm-core/src/clj/org/apache/storm/daemon/supervisor.clj#L536
 the {{port->worker-id}} map is created from the worker heartbeats and there 
will be no entry for port 6702.

> Supervisor may not shut down workers cleanly
> --------------------------------------------
>
>                 Key: STORM-1879
>                 URL: https://issues.apache.org/jira/browse/STORM-1879
>             Project: Apache Storm
>          Issue Type: Bug
>          Components: storm-core
>    Affects Versions: 1.0.1
>            Reporter: Stig Rohde Døssing
>         Attachments: nimbus-supervisor.zip, supervisor.log
>
>
> We've run into a strange issue with a zombie worker process. It looks like 
> the worker pid file somehow got deleted without the worker process shutting 
> down. This causes the supervisor to try repeatedly to kill the worker 
> unsuccessfully, and means multiple workers may be assigned to the same port. 
> The worker root folder sticks around because the worker is still heartbeating 
> to it.
> It may or may not be related that we've seen Nimbus occasionally enter an 
> infinite loop of printing logs similar to the below.
> {code}
> 2016-05-19 14:55:14.196 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.210 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.218 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.256 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.273 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.316 o.a.s.b.BlobStoreUtils [ERROR] Could not update the 
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> {code}
> Which continues until Nimbus is rebooted. We also see repeating blocks 
> similar to the logs below.
> {code}
> 2016-06-02 07:45:03.656 o.a.s.d.nimbus [INFO] Cleaning up 
> ZendeskTicketTopology-127-1464780171
> 2016-06-02 07:45:04.132 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormjar.jar)
> 2016-06-02 07:45:04.144 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormconf.ser)
> 2016-06-02 07:45:04.155 o.a.s.d.nimbus [INFO] 
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormcode.ser)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1879) Supervisor may not shut down workers cleanly

Reply via email to