[jira] [Commented] (STORM-256) storm relanace bug caused supervisor miss topology’s assignment

Robert Joseph Evans (JIRA) Thu, 07 Aug 2014 09:11:08 -0700

    [ 
https://issues.apache.org/jira/browse/STORM-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089380#comment-14089380
 ]


Robert Joseph Evans commented on STORM-256:
-------------------------------------------

I believe you that there is a bug in rebalance on large topologies, it is just 
that storm does not currently do a rebalance the way you have described it, and 
I honestly don't know if it ever did.  Could you please include the exact 
version of storm you are using when you saw this issue?

When a topology is rebalanced nimbus keeps the existing assignments the same 
but marks the topology as needing to be rebalanced.  Those assignments are then 
filtered out of the data structures that are passed to the scheduler so it 
looks like none of the tasks for the rebalanced topology are currently running. 
 The scheduler then decides where to place the tasks for the topology without 
taking into account where the tasks were running previously.

The supervisor for its part is constantly downloading all existing assignments 
from zookeeper and then takes the current assignments that it knows about and 
does a diff of the two.  If they differ then it will shoot processes that are 
no longer scheduled for it to run, and launches new tasks that it is not 
currently running.

> storm relanace bug caused supervisor miss topology’s assignment
> ---------------------------------------------------------------
>
>                 Key: STORM-256
>                 URL: https://issues.apache.org/jira/browse/STORM-256
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>            Reporter: vinceyang
>
> in our 300+ nodes cluster，when do rebalance low probability occurred 
> supervisor miss topology‘s assignmet
> Process as Follows：
>  nimbus rebalance：
> 1 . receive relanace command
> 2.  nimbus chanage job status in zookeeper to "KILLED"
> 3.  compute new assignment and write assignment to zookeeper
> 4. chanage job status to “ACTIVE”
> supervisor rebalance (supervisor watch topology assinment node in zookeeper ):
> 1. when topology's status change to “KILLED”  ,supervisor receive chanage 
> call  mk-synchronize-supervisor function
> 2.  in  mk-synchronize-supervisor function try to read  assignment from 
> zookeeper ，For simplicity we name assigment-A，but before read out topology‘s 
> status has change to “ACTIVE”，job’s assignment changed to assignment-B ， 
> mk-synchronize-supervisor only read out assignment-B miss assignment-A
> 3. assignment-A missed, rebanace not become effective in this supervisor ， 
> the whole topology not woring



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (STORM-256) storm relanace bug caused supervisor miss topology’s assignment

Reply via email to