[
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992035#comment-14992035
]
ASF GitHub Bot commented on STORM-1155:
---------------------------------------
Github user revans2 commented on the pull request:
https://github.com/apache/storm/pull/849#issuecomment-154127266
@longdafeng it might be nice in the future to have some "standard" health
check scripts come with storm, but honestly even with just Linux, it is
difficult to come up with something that will work on all Linux distros, or
even on all boxes with the same distro.
The things I see as potential issues would be the network card fell back to
100Mbit. But that does not work if you are running on old 100Mbit hardware, or
what if for some reason you are using infiniband, with a 100Mbit control path.
Or if you are using bonded NICs? There are too many variables in my opinion to
really make a lot of these generic.
The default directory is `$STORM_HOME/healthcheck` already. Directories
are usually relative to $STORM_HOME, but there have been a few cases where we
needed to make it explicit. @zhuoliu you did a lot of work around resolving
relative paths in configs. could you take a look at this and see if we need to
change anything to make this work so it is always relative to `$STORM_HOME`/?
> Supervisor recurring health checks
> ----------------------------------
>
> Key: STORM-1155
> URL: https://issues.apache.org/jira/browse/STORM-1155
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Reporter: Thomas Graves
> Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin.
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to
> execute properly so you don't want to mark the node as unhealthy.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)