[
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988772#comment-14988772
]
Zhuo Liu edited comment on STORM-1155 at 11/4/15 2:35 AM:
----------------------------------------------------------
Hi Basti,
Thanks for your information. I think it is reasonable to kill those worker if
we find the health check scripts on a node fail.
a. There do exist the scenario where a supervisor cannot heartbeat while its
workers are still working well and heartbeating. In such case, the nimbus will
treat the supervisor as isolated nodes and still treats the workers as healthy.
And health check feature will not break this case.
b. However, if health check scripts fail, it normally indicates that it is no
longer appropriate to run any worker on this node, so it is quite necessary for
us to kill the running workers on this node. Health check scripts do checking
such as the disk available space, network connection, security keys, etc. If we
just exit the supervisor while still let the workers running, such workers may
cause unexpected error. Our initiative for health check is to ban those nodes
who has problems.
was (Author: zhuoliu):
Hi Basti,
Thanks for your information. I think it is reasonable to kill those worker if
we find the health check scripts on a node fail.
a. There do exist the scenario where a supervisor cannot heartbeat while its
workers are still working well and heartbeating. In such case, the nimbus will
treat the supervisor as isolated nodes and still treats the workers as healthy.
And health check feature will not break this case.
b. However, if health check scripts fail, it normally indicates that it is no
longer appropriate to run any worker on this node, so it is quite necessary for
us to kill the running workers on this node. Health check scripts do checking
such as the disk available space, security keys, etc. If we just exit the
supervisor while still let the workers running, such workers may cause
unexpected error. Our initiative for health check is to ban those nodes who has
problems.
> Supervisor recurring health checks
> ----------------------------------
>
> Key: STORM-1155
> URL: https://issues.apache.org/jira/browse/STORM-1155
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin.
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to
> execute properly so you don't want to mark the node as unhealthy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)