[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718766#action_12718766
 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

bq. Steve's comments seem to imply that we need the capability of launch the 
health checking script in response to external commands. 

Hong, while this is true, we can make it work as long as we are using an RPC to 
communicate with the health checking process. One command in the RPC is 
certainly the heartbeat that will both report the status as well as let the TT 
know the health checker is running. The other could be a 'run script now' 
command. This can be added as an extension on top of the basic framework - for 
e.g. when the ping is added to the TT. Would that work ?

bq. Can't the status reporting be simply done via stdout/stderr with START and 
END markers?

I suppose so. We've seen in the past a couple of issues if we're not careful 
with interacting with i/o streams of sub processes. Maybe a lot of these are 
fixed in Hadoop, and so not really an issue. But RPC seems simpler and cleaner. 
Would be glad to find what others think as well.

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to