[ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721042#action_12721042 ]
Hemanth Yamijala commented on HADOOP-5478: ------------------------------------------ Folks, I still maintain that the focus of this jira is just checking health of the node as determined by an administrator supplied script. The last few comments are focusing more on health of a TT. For the purpose of making incremental progress, let us stick to the original scope and defer discussions of checking the health on the TT, and corrective actions there-of, to a separate jira. So, to summarize, the health checker kills itself if it cannot communicate with the TT (similar to the child JVM). If this happens because the TT is down, well and good. The 'lost tasktracker' logic of the jobtracker would ensure this status is captured. If this happens because the TT was overwhelmed, well, maybe the TT is not 'healthy' any more. But the fact that we are reporting timestamps of the last health status gives the administrators an opportunity to know that something is amiss on this node, because it's health has not been updated for a while. Either way we can alert ourselves to problems. So, the purpose is still solved. Of course, there are better, more automated ways to do it. That would qualify for a next increment. Hope this makes sense. > Provide a node health check script and run it periodically to check the node > health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Sreekanth Ramakrishnan > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, > hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch > > > Hadoop must have some mechanism to find the health status of a node . It > should run the health check script periodically and if there is any errors, > it should black list the node. This will be really helpful when we run static > mapred clusters. Else we may have to run some scripts/daemons periodically to > find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.