[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721042#action_12721042
 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

Folks, I still maintain that the focus of this jira is just checking health of 
the node as determined by an administrator supplied script. The last few 
comments are focusing more on health of a TT. For the purpose of making 
incremental progress, let us stick to the original scope and defer discussions 
of checking the health on the TT, and corrective actions there-of, to a 
separate jira.

So, to summarize, the health checker kills itself if it cannot communicate with 
the TT (similar to the child JVM). If this happens because the TT is down, well 
and good. The 'lost tasktracker' logic of the jobtracker would ensure this 
status is captured. If this happens because the TT was overwhelmed, well, maybe 
the TT is not 'healthy' any more. But the fact that we are reporting timestamps 
of the last health status gives the administrators an opportunity to know that 
something is amiss on this node, because it's health has not been updated for a 
while. Either way we can alert ourselves to problems. So, the purpose is still 
solved. Of course, there are better, more automated ways to do it. That would 
qualify for a next increment.

Hope this makes sense.

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to