[ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719991#action_12719991 ]
Hemanth Yamijala commented on HADOOP-5478: ------------------------------------------ To add to Sreekanth's comments: - We are using a new port number for the TT to bind to for the health checker script to send updates. The other option was to use the same port as that used for the TaskUmbilicalProtocol. We thought the health service should not mix with child tasks reporting status and hence kept it different. - The other important point is about how the health checker stops. Currently, the model is similar to how a child stops, in that if it can't report status to the TT, it kills itself. This is anyway required because it has to handle the case of the TT dying unexpectedly. However this is the extreme case. When the TT is stopped normally there are better options to stop the health check script. For e.g. we could add a shutdown hook to TT and send a signal to the health checker. We could make the health checker a separate daemon as well so that stop-mapred could stop it. Any of these options can be easily implemented as a follow-up once the basic structure is in place. Please let us know if these points make sense. > Provide a node health check script and run it periodically to check the node > health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Vinod K V > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, > hadoop-5478-3.patch > > > Hadoop must have some mechanism to find the health status of a node . It > should run the health check script periodically and if there is any errors, > it should black list the node. This will be really helpful when we run static > mapred clusters. Else we may have to run some scripts/daemons periodically to > find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.