[ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718749#action_12718749 ]
Hemanth Yamijala commented on HADOOP-5478: ------------------------------------------ To summarize some of the comments above, the issue we are discussing is whether the node health checker script should be launched as a separate process from the tasktracker (TT) itself, rather than as a thread in the TT, as done in the patch currently. There are some motivations for doing the same: - A periodic process launch from a java service like the TT has caused problems in the past - for e.g. look at HADOOP-5059. - Owen also mentioned instances where they'd seen the service itself lock up due to the process launch (and the underlying fork()/exec()) failing. So, the proposal is to solve this problem by having the node health checker script as a separate process. This process can be configured with the following: - Path to a script - An interval - TT's address for communication. The process would periodically run the script (as done in the patch today) and report the status to the TT using RPC. To keep management simple, we can, in the first cut, launch this process from the TT itself and stop it when the TT is going down. In future, it should be possible to decouple this even more and have them run independently. The simplicity we buy in the first iteration is to not require administrators from worrying about managing this independently for the time being - until we gain some experience with how the health check script is running. Does this sound fine ? > Provide a node health check script and run it periodically to check the node > health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Vinod K V > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch > > > Hadoop must have some mechanism to find the health status of a node . It > should run the health check script periodically and if there is any errors, > it should black list the node. This will be really helpful when we run static > mapred clusters. Else we may have to run some scripts/daemons periodically to > find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.