[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720514#action_12720514
 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

bq. The problem is that if there is something wrong that prevents the health 
checker from communicating to TT, health checker would quit voluntarily without 
TT's knowledge

That does sound like an issue. Maybe one simple solution is to send a timestamp 
with the TaskTrackerStatus report about when the health checker was last run. I 
am of course borrowing the idea from the information we have about when the 
last heartbeat was received from a TT. We could use that information to find 
out trackers that haven't updated their health for longer than a certain 
interval. Would that work ?

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to