[ 
https://issues.apache.org/jira/browse/STORM-1383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15058309#comment-15058309
 ] 

ASF GitHub Bot commented on STORM-1383:
---------------------------------------

Github user d2r commented on the pull request:

    https://github.com/apache/storm/pull/938#issuecomment-164825084
  
    * JDK7 storm-core: drpc-auth-test race lost binding ephemeral port
        ```
        118390 [Thread-1037] ERROR b.s.s.a.ThriftServer - ThriftServer is being 
stopped due to: org.apache.thrift.transport.TTransportException: Could not 
create ServerSocket on address 0.0.0.0/0.0.0.0:52747.
        ```
    * JDK8 storm-core: nimbus-auth-test failed, likely will be fixed by #941 
    
    * JDK7 !storm-core: org.apache.storm.cassandra.DynamicStatementBuilderTest:
        ```
        java.lang.AssertionError: Cassandra daemon did not start within timeout
        ```
      Created [STORM-1392](https://issues.apache.org/jira/browse/STORM-1392)
    
    None of these is related to this PR.  I will close and re-open this PR to 
start a new test run.



> Supervisors should not crash if nimbus is unavailable
> -----------------------------------------------------
>
>                 Key: STORM-1383
>                 URL: https://issues.apache.org/jira/browse/STORM-1383
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 0.11.0
>            Reporter: Derek Dagit
>            Assignee: Derek Dagit
>
> In cases of maintenance or unexpected downtime of nimbus nodes, supervisors 
> will crash in a loop.  This can cause a lot of confusion among users 
> (supervisors crash repeatedly) and admins (monitoring/alerting triggered for 
> the entire cluster).
> Supervisors periodically check with nimbus to synchronize blob versions, and 
> as part of this, a connection is made to the leader nimbus daemon.  Formerly, 
> supervisors did not periodically contact nimbus, and so nimbus downtime did 
> not cascade to cluster-wide supervisor failures.
> It might be nice to handle the case when nimbus cannot be contacted, and 
> continue in the normal loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to