[
https://issues.apache.org/jira/browse/STORM-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089380#comment-14089380
]
Robert Joseph Evans commented on STORM-256:
-------------------------------------------
I believe you that there is a bug in rebalance on large topologies, it is just
that storm does not currently do a rebalance the way you have described it, and
I honestly don't know if it ever did. Could you please include the exact
version of storm you are using when you saw this issue?
When a topology is rebalanced nimbus keeps the existing assignments the same
but marks the topology as needing to be rebalanced. Those assignments are then
filtered out of the data structures that are passed to the scheduler so it
looks like none of the tasks for the rebalanced topology are currently running.
The scheduler then decides where to place the tasks for the topology without
taking into account where the tasks were running previously.
The supervisor for its part is constantly downloading all existing assignments
from zookeeper and then takes the current assignments that it knows about and
does a diff of the two. If they differ then it will shoot processes that are
no longer scheduled for it to run, and launches new tasks that it is not
currently running.
> storm relanace bug caused supervisor miss topology’s assignment
> ---------------------------------------------------------------
>
> Key: STORM-256
> URL: https://issues.apache.org/jira/browse/STORM-256
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Reporter: vinceyang
>
> in our 300+ nodes cluster,when do rebalance low probability occurred
> supervisor miss topology‘s assignmet
> Process as Follows:
> nimbus rebalance:
> 1 . receive relanace command
> 2. nimbus chanage job status in zookeeper to "KILLED"
> 3. compute new assignment and write assignment to zookeeper
> 4. chanage job status to “ACTIVE”
> supervisor rebalance (supervisor watch topology assinment node in zookeeper ):
> 1. when topology's status change to “KILLED” ,supervisor receive chanage
> call mk-synchronize-supervisor function
> 2. in mk-synchronize-supervisor function try to read assignment from
> zookeeper ,For simplicity we name assigment-A,but before read out topology‘s
> status has change to “ACTIVE”,job’s assignment changed to assignment-B ,
> mk-synchronize-supervisor only read out assignment-B miss assignment-A
> 3. assignment-A missed, rebanace not become effective in this supervisor ,
> the whole topology not woring
--
This message was sent by Atlassian JIRA
(v6.2#6252)