[ 
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988787#comment-14988787
 ] 

Longda Feng commented on STORM-1155:
------------------------------------

@all

I think this is very useful to improve stability of Storm.

In the last four years, I have met so many errors, any one of these error will 
lead to the whole cluster slow down. So it is definitely need one method to 
check current node's status.

The only question is that we should design one common interface, make it is 
very easy to plugin one script or some executable binary.

I have met:
(1) disk is full
(2) network error,  one node can't ping other node, or the latency is pretty 
high
(3) the system is out of memory
(4) some binary has been removed, such java/storm binary.
(5) kernel error, sometimes it maybe memory hardware error, network adaptor  
error.





> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to 
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. 
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on 
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to 
> execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to