[ 
https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089371#comment-14089371
 ] 

Robert Joseph Evans commented on STORM-397:
-------------------------------------------

We have not seen this issue so far.  How reproducible is the issue?  Does this 
happen every time the supervisor dies?  Are you using the default even 
scheduler or is it the isolation scheduler? If it is the isolation scheduler 
are either of the topologies isolated?  I assume not, because you indicated 
that multiple topologies were running on the same supervisor.

Is it possible for you to share the logs, including the logs for nimbus, the 
supervisors, and the workers.  Or to share a stripped down version of your 
topologies that can use to reproduce the issue?

In general storm does not reassign workers when a supervisor goes down, because 
the supervisor is supposed to be under supervision and be restarted quickly.  
Nimbus assumes that any failure there is transient.  However, if a worker fails 
to heartbeat into nimbus, through zookeeper, for a default timeout of 30 
seconds, it will assume that the worker has died and will mark the worker to be 
rescheduled.  If the supervisor is down at the time this rescheduling happens 
then nimbus will reschedule the worker on a different node that has a slot 
free.  If the supervisor is still up when the rescheduling happens or manages 
to restart in time for nimbus to see it up there is a possibility that the 
scheduler will place the worker in the same slot it was in before.  There are 
also lots of corner cases when you have a cluster that is nearly full and 
enough nodes go down that there are no longer enough slots to run everything.

Just looking at the log lines provided, the only thing that seems to be 
changing between the assignments is the start time aka when they were 
reassigned, and two assignments switch places in their order.  Looking at the 
code the assignment ordering looks like it might be enough to trigger the logs 
being printed out, but I don't really know for sure.  In either case, both 
schedulers really only reschedules workers when they think that they are not up 
and functioning.

> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
>                 Key: STORM-397
>                 URL: https://issues.apache.org/jira/browse/STORM-397
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: 2 topologies, 3 supervisors
>            Reporter: Simon Cooper
>            Priority: Critical
>
> We're running two topologies on a cluster with 3 supervisors. By default, 
> both topologies are assigned onto the same supervisor. If that supervisor 
> dies, storm reassigns one topology to another supervisor but not the other, 
> leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus 
> logs, from when the topologies are initially submitted, nimbus is continually 
> trying to reassign the second topology to the same supervisor every 10 
> seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9 
> 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9 
> 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9 
> 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies - 
> nimbus continually tries to reassign to a dead supervisor. Note that the 
> other topology is reassigned elsewhere without problems.
> If the broken topology is rebalanced, only then does nimbus assign the 
> topology to a working supervisor.
> Another symptom of this is that, when the machines running storm are started, 
> only one topology is running on startup. The second topology is not assigned 
> to a supervisor. Again, it takes a rebalance for nimbus to actually assign 
> the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't 
> really understand those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a 
> supervisor were to die, storm would automatically fail over to another 
> supervisor. This does not happen, leaving our cluster with a SPOF.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to