Robert Joseph Evans created STORM-909:
-----------------------------------------

             Summary: Automatic Black Listing of bad nodes
                 Key: STORM-909
                 URL: https://issues.apache.org/jira/browse/STORM-909
             Project: Apache Storm
          Issue Type: Improvement
            Reporter: Robert Joseph Evans


We should be able to detect and monitor the failure rate of workers on nodes, 
and come up with a few different probabilities.  How likely is it that this 
worker will fail on this particular node in the next n mins.  How likely is it 
that all workers will fail on this particular node in the next n mins.  How 
likely is it that this worker will fail on any node in the next n mins.

With these we should be able to detect bad nodes and blacklist them, and 
ideally trigger external systems that can take actions to try and fix the 
nodes.  We should also be able to detect topologies that have bugs in the 
common case warn them, and in the worst case stop trying to run them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to