[jira] [Commented] (STORM-1155) Supervisor recurring health checks

ASF GitHub Bot (JIRA) Thu, 05 Nov 2015 09:20:13 -0800

    [ 
https://issues.apache.org/jira/browse/STORM-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992035#comment-14992035
 ]


ASF GitHub Bot commented on STORM-1155:
---------------------------------------

Github user revans2 commented on the pull request:

    https://github.com/apache/storm/pull/849#issuecomment-154127266
  
    @longdafeng it might be nice in the future to have some "standard" health 
check scripts come with storm, but honestly even with just Linux, it is 
difficult to come up with something that will work on all Linux distros, or 
even on all boxes with the same distro.
    
    The things I see as potential issues would be the network card fell back to 
100Mbit.  But that does not work if you are running on old 100Mbit hardware, or 
what if for some reason you are using infiniband, with a 100Mbit control path.  
Or if you are using bonded NICs?  There are too many variables in my opinion to 
really make a lot of these generic.
    
    The default directory is `$STORM_HOME/healthcheck` already.  Directories 
are usually relative to $STORM_HOME, but there have been a few cases where we 
needed to make it explicit.  @zhuoliu you did a lot of work around resolving 
relative paths in configs.  could you take a look at this and see if we need to 
change anything to make this work so it is always relative to `$STORM_HOME`/?


> Supervisor recurring health checks
> ----------------------------------
>
>                 Key: STORM-1155
>                 URL: https://issues.apache.org/jira/browse/STORM-1155
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>
> Add the ability for the supervisor to call out to health check scripts to 
> allow some validation of the health of the node the supervisor is running on.
> It could regularly run scripts in a directory provided by the cluster admin. 
> If any scripts fail, it should kill the workers and stop itself.
> This could work very much like the Hadoop scripts and if ERROR is returned on 
> stdout it means the node has some issue and we should shut down.
> If a non-zero exit code is returned it indicates that the scripts failed to 
> execute properly so you don't want to mark the node as unhealthy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (STORM-1155) Supervisor recurring health checks

Reply via email to