[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708535#action_12708535
 ] 

Allen Wittenauer commented on HADOOP-5478:
------------------------------------------



Torque specifically looks for a line that begins with ERROR on stdout, but 
reports the whole line in the node status.  So running pbsnodes will show the 
full status message and provides an easy way to audit all nodes on a giving 
torque server.  We really need the equivalent of dfsadmin -report for the 
JobTracker to provide this same level of output.

Additionally, torque ignores the exit status. In the vast majority of cases, 
the node is going to be good.  So the approach they take is that if a script 
has a syntax error (and would therefore have a 'fail' as an exit code), the 
node should be considered good anyway.

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to