[
https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103946#comment-14103946
]
Robert Joseph Evans commented on STORM-397:
-------------------------------------------
[~thecoop1984] Sorry it took me so long to see your comment about the fix
working. I turned the diff into a formal pull request against STORM-341.
Because it fixed both issues I will dupe this one to STORM-341 and increase the
priority of it to critical to match the priority here. If you feel this is in
error please let me know.
> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
> Key: STORM-397
> URL: https://issues.apache.org/jira/browse/STORM-397
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating
> Environment: 2 topologies, 3 supervisors
> Reporter: Simon Cooper
> Priority: Critical
> Attachments: nimbus.log, storm.png
>
>
> We're running two topologies on a cluster with 3 supervisors. By default,
> both topologies are assigned onto the same supervisor. If that supervisor
> dies, storm reassigns one topology to another supervisor but not the other,
> leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus
> logs, from when the topologies are initially submitted, nimbus is continually
> trying to reassign the second topology to the same supervisor every 10
> seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9
> 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9
> 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9
> 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies -
> nimbus continually tries to reassign to a dead supervisor. Note that the
> other topology is reassigned elsewhere without problems.
> If the broken topology is rebalanced, only then does nimbus assign the
> topology to a working supervisor.
> Another symptom of this is that, when the machines running storm are started,
> only one topology is running on startup. The second topology is not assigned
> to a supervisor. Again, it takes a rebalance for nimbus to actually assign
> the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't
> really understand those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a
> supervisor were to die, storm would automatically fail over to another
> supervisor. This does not happen, leaving our cluster with a SPOF.
--
This message was sent by Atlassian JIRA
(v6.2#6252)