[
https://issues.apache.org/jira/browse/STORM-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354803#comment-15354803
]
Jungtaek Lim commented on STORM-1879:
-------------------------------------
Sorry for late participating. I have been struggling with other works.
I'm suspecting that many issues from supervisor are from race condition
(sync-supervisor and sync-processes).
One of supervisor test is intermittently failing
([STORM-1933|https://issues.apache.org/jira/browse/STORM-1933]), and after
digging I found that supervisor has race condition which can create various
issues.
(What [~nico.meyer] pointed out seems to be same to what STORM-1933 shows.)
I submitted a [patch|https://github.com/apache/storm/pull/1528] to
[STORM-1934|https://issues.apache.org/jira/browse/STORM-1934] so I'd be really
happy if you applies my patch and see it works.
> Supervisor may not shut down workers cleanly
> --------------------------------------------
>
> Key: STORM-1879
> URL: https://issues.apache.org/jira/browse/STORM-1879
> Project: Apache Storm
> Issue Type: Bug
> Components: storm-core
> Affects Versions: 1.0.1
> Reporter: Stig Rohde Døssing
> Attachments: fix_missing_worker_pid.patch, nimbus-supervisor.zip,
> supervisor.log
>
>
> We've run into a strange issue with a zombie worker process. It looks like
> the worker pid file somehow got deleted without the worker process shutting
> down. This causes the supervisor to try repeatedly to kill the worker
> unsuccessfully, and means multiple workers may be assigned to the same port.
> The worker root folder sticks around because the worker is still heartbeating
> to it.
> It may or may not be related that we've seen Nimbus occasionally enter an
> infinite loop of printing logs similar to the below.
> {code}
> 2016-05-19 14:55:14.196 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.210 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.218 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> 2016-05-19 14:55:14.256 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.273 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormcode.ser
> 2016-05-19 14:55:14.316 o.a.s.b.BlobStoreUtils [ERROR] Could not update the
> blob with keyZendeskTicketTopology-5-1463647641-stormconf.ser
> {code}
> Which continues until Nimbus is rebooted. We also see repeating blocks
> similar to the logs below.
> {code}
> 2016-06-02 07:45:03.656 o.a.s.d.nimbus [INFO] Cleaning up
> ZendeskTicketTopology-127-1464780171
> 2016-06-02 07:45:04.132 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormjar.jar)
> 2016-06-02 07:45:04.144 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormconf.ser)
> 2016-06-02 07:45:04.155 o.a.s.d.nimbus [INFO]
> ExceptionKeyNotFoundException(msg:ZendeskTicketTopology-127-1464780171-stormcode.ser)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)