[
https://issues.apache.org/jira/browse/STORM-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050074#comment-15050074
]
ASF GitHub Bot commented on STORM-1383:
---------------------------------------
GitHub user d2r opened a pull request:
https://github.com/apache/storm/pull/938
[STORM-1383] Avoid supervisor crashing if nimbus is unavailable
* Adds a new exception to differentiate the failure to find a nimbus leader
from a network failure.
* Since blobs are scanned for updates periodically, do not crash the
supervisor if nimbus is not available.
* Do not crash the supervisor if nimbus is unavailable when downloading
topology resources for launch.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/d2r/storm storm-1383-nimbus-supvor-crash-loop
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/storm/pull/938.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #938
----
commit c007aa181d140cbbce505a3e3147b433b758d80f
Author: Derek Dagit <[email protected]>
Date: 2015-12-10T05:02:49Z
Avoid supervisor crashing if nimbus is unavailable
----
> Supervisors should not crash if nimbus is unavailable
> -----------------------------------------------------
>
> Key: STORM-1383
> URL: https://issues.apache.org/jira/browse/STORM-1383
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Affects Versions: 0.11.0
> Reporter: Derek Dagit
> Assignee: Derek Dagit
>
> In cases of maintenance or unexpected downtime of nimbus nodes, supervisors
> will crash in a loop. This can cause a lot of confusion among users
> (supervisors crash repeatedly) and admins (monitoring/alerting triggered for
> the entire cluster).
> Supervisors periodically check with nimbus to synchronize blob versions, and
> as part of this, a connection is made to the leader nimbus daemon. Formerly,
> supervisors did not periodically contact nimbus, and so nimbus downtime did
> not cascade to cluster-wide supervisor failures.
> It might be nice to handle the case when nimbus cannot be contacted, and
> continue in the normal loop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)