[jira] [Comment Edited] (STORM-1155) Supervisor recurring health checks

Zhuo Liu (JIRA) Tue, 03 Nov 2015 18:36:58 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988772#comment-14988772
 ]


Zhuo Liu edited comment on STORM-1155 at 11/4/15 2:35 AM:
----------------------------------------------------------

Hi Basti, 
Thanks for your information. I think it is reasonable to kill those worker if 
we find the health check scripts on a node fail.

a. There do exist the scenario where a supervisor cannot heartbeat while its 
workers are still working well and heartbeating. In such case, the nimbus will 
treat the supervisor as isolated nodes and still treats the workers as healthy. 
And health check feature will not break this case.

b. However, if health check scripts fail, it normally indicates that it is no 
longer appropriate to run any worker on this node, so it is quite necessary for 
us to kill the running workers on this node. Health check scripts do checking 
such as the disk available space, network connection, security keys, etc. If we 
just exit the supervisor while still let the workers running, such workers may 
cause unexpected error. Our initiative for health check is to ban those nodes 
who has problems.  


was (Author: zhuoliu):
Hi Basti, 
Thanks for your information. I think it is reasonable to kill those worker if 
we find the health check scripts on a node fail.

a. There do exist the scenario where a supervisor cannot heartbeat while its 
workers are still working well and heartbeating. In such case, the nimbus will 
treat the supervisor as isolated nodes and still treats the workers as healthy. 
And health check feature will not break this case.

b. However, if health check scripts fail, it normally indicates that it is no 
longer appropriate to run any worker on this node, so it is quite necessary for 
us to kill the running workers on this node. Health check scripts do checking 
such as the disk available space, security keys, etc. If we just exit the 
supervisor while still let the workers running, such workers may cause 
unexpected error. Our initiative for health check is to ban those nodes who has 
problems.  

> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to 
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. 
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on 
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to 
> execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (STORM-1155) Supervisor recurring health checks

Reply via email to