[ 
https://issues.apache.org/jira/browse/STORM-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050074#comment-15050074
 ] 

ASF GitHub Bot commented on STORM-1383:
---------------------------------------

GitHub user d2r opened a pull request:

    https://github.com/apache/storm/pull/938

    [STORM-1383] Avoid supervisor crashing if nimbus is unavailable

    * Adds a new exception to differentiate the failure to find a nimbus leader 
from a network failure.
    * Since blobs are scanned for updates periodically, do not crash the 
supervisor if nimbus is not available.
    * Do not crash the supervisor if nimbus is unavailable when downloading 
topology resources for launch.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/d2r/storm storm-1383-nimbus-supvor-crash-loop

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/storm/pull/938.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #938
    
----
commit c007aa181d140cbbce505a3e3147b433b758d80f
Author: Derek Dagit <[email protected]>
Date:   2015-12-10T05:02:49Z

    Avoid supervisor crashing if nimbus is unavailable

----


> Supervisors should not crash if nimbus is unavailable
> -----------------------------------------------------
>
>                 Key: STORM-1383
>                 URL: https://issues.apache.org/jira/browse/STORM-1383
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 0.11.0
>            Reporter: Derek Dagit
>            Assignee: Derek Dagit
>
> In cases of maintenance or unexpected downtime of nimbus nodes, supervisors 
> will crash in a loop.  This can cause a lot of confusion among users 
> (supervisors crash repeatedly) and admins (monitoring/alerting triggered for 
> the entire cluster).
> Supervisors periodically check with nimbus to synchronize blob versions, and 
> as part of this, a connection is made to the leader nimbus daemon.  Formerly, 
> supervisors did not periodically contact nimbus, and so nimbus downtime did 
> not cascade to cluster-wide supervisor failures.
> It might be nice to handle the case when nimbus cannot be contacted, and 
> continue in the normal loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to