[
https://issues.apache.org/jira/browse/STORM-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250102#comment-14250102
]
Michael Noll commented on STORM-112:
------------------------------------
I think I can confirm this is still affecting Storm as of 0.9.2.
It may also be caused by killing a topology with a small kill wait time (say,
0-5 seconds), followed by resubmitting the same topology immediately or a few
seconds after killing the previous running instance.
> Race condition between Topology Kill and Worker Timeout can crash supervisor
> ----------------------------------------------------------------------------
>
> Key: STORM-112
> URL: https://issues.apache.org/jira/browse/STORM-112
> Project: Apache Storm
> Issue Type: Bug
> Reporter: James Xu
>
> Recently during testing on a single node cluster we saw a supervisor crash
> when a topology was killed. The supervisor came back up and recovered, so it
> was not that big of a deal, but when we dug into it, it appears that there is
> a race.
> https://github.com/nathanmarz/storm/issues/656
> When a topology is killed the local assignments are reset, and then
> stormconf.ser is deleted right away. But at the same time sync-process may
> already be running with old state indicating that a worker timed out and
> needs to be relaunched. launch-worker then tries to read in the topology conf
> which was deleted and crashes.
> The following is a sanitized version of the supervisor log that shows this
> happening.
> https://gist.github.com/revans2/6282830
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)