[jira] [Commented] (STORM-112) Race condition between Topology Kill and Worker Timeout can crash supervisor

Michael Noll (JIRA) Wed, 17 Dec 2014 08:50:35 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250102#comment-14250102
 ]


Michael Noll commented on STORM-112:
------------------------------------

I think I can confirm this is still affecting Storm as of 0.9.2.

It may also be caused by killing a topology with a small kill wait time (say, 
0-5 seconds), followed by resubmitting the same topology immediately or a few 
seconds after killing the previous running instance.

> Race condition between Topology Kill and Worker Timeout can crash supervisor
> ----------------------------------------------------------------------------
>
>                 Key: STORM-112
>                 URL: https://issues.apache.org/jira/browse/STORM-112
>             Project: Apache Storm
>          Issue Type: Bug
>            Reporter: James Xu
>
> Recently during testing on a single node cluster we saw a supervisor crash 
> when a topology was killed. The supervisor came back up and recovered, so it 
> was not that big of a deal, but when we dug into it, it appears that there is 
> a race.
> https://github.com/nathanmarz/storm/issues/656
> When a topology is killed the local assignments are reset, and then 
> stormconf.ser is deleted right away. But at the same time sync-process may 
> already be running with old state indicating that a worker timed out and 
> needs to be relaunched. launch-worker then tries to read in the topology conf 
> which was deleted and crashes.
> The following is a sanitized version of the supervisor log that shows this 
> happening.
> https://gist.github.com/revans2/6282830



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-112) Race condition between Topology Kill and Worker Timeout can crash supervisor

Reply via email to