[ 
https://issues.apache.org/jira/browse/STORM-397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14089721#comment-14089721
 ] 

Robert Joseph Evans commented on STORM-397:
-------------------------------------------

This is really odd I have not been able to reproduce this.  My only guess is 
that when you pull the network cable Async gets an error and dies quickly, 
where as Sync does not and tries to reconnect to zookeeper.  When you plug the 
cable back in, it finally succeeds in reconnecting and Nimbus continues to see 
it up and running.  Do you have your supervisors running under supervision so 
they are restarted when they exit?

The 90MB of logs is almost definitely STORM-341.  I'll try to take the proposed 
fix for STORM-341 and come up with a patch you can try and see if that at least 
clear up the logs for you.  Could you please look at the logs for your workers 
on sc-alpha-r and try to determine when the workers crashed because of the 
network cut, or if they continued running.

> Nimbus does not reassign a topology when the supervisor dies
> ------------------------------------------------------------
>
>                 Key: STORM-397
>                 URL: https://issues.apache.org/jira/browse/STORM-397
>             Project: Apache Storm (Incubating)
>          Issue Type: Bug
>    Affects Versions: 0.9.2-incubating
>         Environment: 2 topologies, 3 supervisors
>            Reporter: Simon Cooper
>            Priority: Critical
>         Attachments: nimbus.log, storm.png
>
>
> We're running two topologies on a cluster with 3 supervisors. By default, 
> both topologies are assigned onto the same supervisor. If that supervisor 
> dies, storm reassigns one topology to another supervisor but not the other, 
> leaving the second topology inactive.
> There are various symptoms/possible causes of this problem. In the nimbus 
> logs, from when the topologies are initially submitted, nimbus is continually 
> trying to reassign the second topology to the same supervisor every 10 
> seconds:
> {noformat}
> 2014-07-09 14:17:11 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911831, [7 8] 1404911831, [3 4] 1404911831, [12 12] 1404911831, [9 
> 10] 1404911831, [5 5] 1404911831, [11 11] 1404911831, [6 6] 1404911831}}
> 2014-07-09 14:17:21 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911841, [7 8] 1404911841, [3 4] 1404911841, [12 12] 1404911841, [9 
> 10] 1404911841, [5 5] 1404911841, [11 11] 1404911841, [6 6] 1404911841}}
> 2014-07-09 14:17:32 -: b.s.d.nimbus [INFO] Setting new assignment for 
> topology id Sync-1-1404911509: 
> #backtype.storm.daemon.common.Assignment{:master-code-dir 
> "/storm/nimbus/stormdist/Sync-1-1404911509", :node->host 
> {"9f5f2ddd-40ee-4ac1-b705-2957089af330" "sc-beta-r"}, :executor->node+port 
> {[6 6] ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [11 11] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [5 5] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [9 10] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [12 12] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [3 4] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [7 8] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703], [1 2] 
> ["9f5f2ddd-40ee-4ac1-b705-2957089af330" 6703]}, :executor->start-time-secs 
> {[1 2] 1404911852, [7 8] 1404911852, [3 4] 1404911852, [12 12] 1404911852, [9 
> 10] 1404911852, [5 5] 1404911852, [11 11] 1404911852, [6 6] 1404911852}}
> {noformat}
> These log messages continue after the supervisor it's running on dies - 
> nimbus continually tries to reassign to a dead supervisor. Note that the 
> other topology is reassigned elsewhere without problems.
> If the broken topology is rebalanced, only then does nimbus assign the 
> topology to a working supervisor.
> Another symptom of this is that, when the machines running storm are started, 
> only one topology is running on startup. The second topology is not assigned 
> to a supervisor. Again, it takes a rebalance for nimbus to actually assign 
> the topology somewhere.
> A couple of possibly related bugs are STORM-256 and STORM-341, but I don't 
> really understand those bugs enough to be able to link it to these problems.
> This is a major issue for us. One of the reasons for using storm is that if a 
> supervisor were to die, storm would automatically fail over to another 
> supervisor. This does not happen, leaving our cluster with a SPOF.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to