[ https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720122#action_12720122 ]
Hemanth Yamijala commented on HADOOP-5478: ------------------------------------------ This is a reasonably complex patch, and in the current state, it has already come a long way. I do have the following comments: Comments on the spec: - It may be useful to allow arguments to be provided to the health check script. Shell: - It seems like it will be easier to use a Timer to do the interrupting rather than implementing the logic ourselves. - The timeout thread seems to be being interrupted and joined twice. - Do we need the setTimeoutInterval ? What happens if it is set after the command has started. I think it is better to have it set only once. Configuration: - Default value for mapred.healthChecker.interval can be a little larger, like 60000. - Let's call mapred.healthChecker.failure_interval mapred.healthChecker.script.timeout. Likewise the variable in NodeHealthChecker failureintervalTime as scriptTimeout or some such name. NodeHealthCheckerService: - Instead of setting the health status in the TaskTracker, I think the variable can simply be stored as a local variable by NodeHealthCheckerService, and TT can call it using an appropriately synchronized accessor method. - If the thread launching the health checker VM is a daemon, I don't think there's a need to join it in close. - Does NodeHealthCheckerService need to be a public class ? - Please give a name to the thread launching the health checker VM NodeHealthChecker: - Typo: HEALTH_CHECK_INTERVAL_PREOPERTY should be HEALTH_CHECK_INTERVAL_PROPERTY - If there is an exception in executing the script - should we treat this as a failure ? As a follow-up, should we ignore the output in such a case, as it may not be meaningful to rely upon ? - A related question is what to do if the output is null. Currently we don't seem to be reporting anything at all. This clearly seems like a bug. I think if it did not timeout, this should be treated as a success. - Also, should we be more careful in where we check for ERROR. I think the output should begin with the string ERROR, rather than ERROR appearing anywhere in the string. - I think we should check whether the script timed out before we check the output. It reads more logically that way. - I would also suggest we move the code in the finally block to a separate method for making it more obvious. Among other things it would help us unit test this very important logic easily. - I think the check for the health check script should be done in NodeHealthCheckerService. This will avoid us the cost of launching the VM and then failing. However, basic checks for the script can still be there in NodeHealthChecker so it can be used as an independent unit. - Where do the log messages of the NodeHealthChecker go ? I think it should go to a separate log file. Give a name to the Timer object created for running the health check script. JobTracker: - unBlacklistTracker may not blacklist a tracker if the reason for blacklisting was different. Shouldn't this be checked whenever a call is made, and the subsequent actions be done only if unBlacklistTracker returns a success. - When tasktracker status is updated because the node becomes healthy, shouldn't isBlackListed be set to true ? TaskTracker: - shutdown() seems to have got an extraneous log message by mistake. Test cases: - In the testcase, can we also verify the healthReport value for unhealthy trackers ? - I think we also need a testcase that combines task failures with health script failing and make sure we don't unblacklist a tracker if it was blacklisted due to another reason. Documentation: - New classes (like NodeHealthCheckerService) need Javadoc. Indeed, please document all new methods. - We also need Forrest documentation on this super cool feature. *smile* > Provide a node health check script and run it periodically to check the node > health status > ------------------------------------------------------------------------------------------ > > Key: HADOOP-5478 > URL: https://issues.apache.org/jira/browse/HADOOP-5478 > Project: Hadoop Core > Issue Type: New Feature > Components: mapred > Affects Versions: 0.20.0 > Reporter: Aroop Maliakkal > Assignee: Vinod K V > Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, > hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch > > > Hadoop must have some mechanism to find the health status of a node . It > should run the health check script periodically and if there is any errors, > it should black list the node. This will be really helpful when we run static > mapred clusters. Else we may have to run some scripts/daemons periodically to > find the node status and take it offline manually. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.