[
https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Simon Cooper updated STORM-397:
-------------------------------
Attachment: nimbus.log
storm.png
This is completely reproducible - every time on cluster startup we need to
manually rebalance nimbus to get it running the topologies somewhere, and every
time we kill the machine running the topologies it stops working.
Here's the nimbus logfile - sc-alpha-r (5abe7707-763e-461a-9327-ae2ad2979e4d)
is running two topologies, called 'Sync' and 'Async', sc-beta-r is running
nimbus and another spare supervisor (b252b5bd-c0c3-442b-b464-9936f628fd01), and
a third supervisor is on sc-gamma-r (9466061c-dd13-4de6-927c-ac6a5c60adeb). The
zookeeper cluster is running across those three machines as well.
I pull the network cable out of sc-alpha-r between 17:15:40 and 17:16:00.
Notice that, beforehand, nimbus is still trying to reassign Sync to sc-alpha-r,
even though it's running there already (that log file was originally 90MB,
filled with those messages)
After 30s, nimbus notices that Async is not running, and reassigns it to
sc-beta-r. It still tries to reassign Sync to sc-alpha-r.
Interestingly, in the UI, it now shows the two supervisors still running, but
only one slot is being used (two should be used, one for each topology). I've
attached the screenshot.
At 17:37:04, I manually rebalance Sync (notice that nimbus thinks that Sync is
still up), and it reassigns it to sc-beta-r. But then it continues to reassign
to sc-beta-r, and it will continue to do so forever.
My guess is that nimbus is getting stuck in a loop, continually trying to
reassign the Sync topology, and is so busy doing that that it doesn't notice
the worker running Sync has disappeared.
> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
> Key: STORM-397
> URL: https://issues.apache.org/jira/browse/STORM-397
> Project: Apache Storm (Incubating)
> Issue Type: Bug
> Affects Versions: 0.9.2-incubating
> Environment: 2 topologies, 3 supervisors
> Reporter: Simon Cooper
> Priority: Critical
> Attachments: nimbus.log, storm.png
>
>
> We're running two topologies on a cluster with 3 supervisors. By default,
> both topologies are assigned onto the same supervisor. If that supervisor
> dies, storm reassigns one topology to another supervisor but not the other,
> leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus
> logs, from when the topologies are initially submitted, nimbus is continually
> trying to reassign the second topology to the same supervisor every 10
> seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9
> 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9
> 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for
> topology id Sync-1-1404911509:
> #backtype.storm.daemon.common.Assignment{:master-code-dir
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2]
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs
> {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9
> 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies -
> nimbus continually tries to reassign to a dead supervisor. Note that the
> other topology is reassigned elsewhere without problems.
> If the broken topology is rebalanced, only then does nimbus assign the
> topology to a working supervisor.
> Another symptom of this is that, when the machines running storm are started,
> only one topology is running on startup. The second topology is not assigned
> to a supervisor. Again, it takes a rebalance for nimbus to actually assign
> the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't
> really understand those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a
> supervisor were to die, storm would automatically fail over to another
> supervisor. This does not happen, leaving our cluster with a SPOF.
--
This message was sent by Atlassian JIRA
(v6.2#6252)