[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720122#action_12720122
 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

This is a reasonably complex patch, and in the current state, it has already 
come a long way. I do have the following comments:

Comments on the spec:

- It may be useful to allow arguments to be provided to the health check script.

Shell:

- It seems like it will be easier to use a Timer to do the interrupting rather 
than implementing the logic ourselves.
- The timeout thread seems to be being interrupted and joined twice.
- Do we need the setTimeoutInterval ? What happens if it is set after the 
command has started. I think it is better to have it set only once.

Configuration:

- Default value for mapred.healthChecker.interval can be a little larger, like 
60000.
- Let's call mapred.healthChecker.failure_interval 
mapred.healthChecker.script.timeout. Likewise the variable in NodeHealthChecker 
failureintervalTime as scriptTimeout or some such name.

NodeHealthCheckerService:

- Instead of setting the health status in the TaskTracker, I think the variable 
can simply be stored as a local variable by NodeHealthCheckerService, and TT 
can call it using an appropriately synchronized accessor method.
- If the thread launching the health checker VM is a daemon, I don't think 
there's a need to join it in close.
- Does NodeHealthCheckerService need to be a public class ?
- Please give a name to the thread launching the health checker VM

NodeHealthChecker:

- Typo: HEALTH_CHECK_INTERVAL_PREOPERTY should be HEALTH_CHECK_INTERVAL_PROPERTY
- If there is an exception in executing the script - should we treat this as a 
failure ? As a follow-up, should we ignore the output in such a case, as it may 
not be meaningful to rely upon ?
- A related question is what to do if the output is null. Currently we don't 
seem to be reporting anything at all. This clearly seems like a bug. I think if 
it did not timeout, this should be treated as a success.
- Also, should we be more careful in where we check for ERROR. I think the 
output should begin with the string ERROR, rather than ERROR appearing anywhere 
in the string.
- I think we should check whether the script timed out before we check the 
output. It reads more logically that way.
- I would also suggest we move the code in the finally block to a separate 
method for making it more obvious. Among other things it would help us unit 
test this very important logic easily.
- I think the check for the health check script should be done in 
NodeHealthCheckerService. This will avoid us the cost of launching the VM and 
then failing. However, basic checks for the script can still be there in 
NodeHealthChecker so it can be used as an independent unit.
- Where do the log messages of the NodeHealthChecker go ? I think it should go 
to a separate log file.
Give a name to the Timer object created for running the health check script.

JobTracker:

- unBlacklistTracker may not blacklist a tracker if the reason for blacklisting 
was different. Shouldn't this be checked whenever a call is made, and the 
subsequent actions be done only if unBlacklistTracker returns a success.
- When tasktracker status is updated because the node becomes healthy, 
shouldn't isBlackListed be set to true ?

TaskTracker:
- shutdown() seems to have got an extraneous log message by mistake.

Test cases:
- In the testcase, can we also verify the healthReport value for unhealthy 
trackers ?
- I think we also need a testcase that combines task failures with health 
script failing and make sure we don't unblacklist a tracker if it was 
blacklisted due to another reason.

Documentation:
- New classes (like NodeHealthCheckerService) need Javadoc. Indeed, please 
document all new methods.
- We also need Forrest documentation on this super cool feature. *smile* 


> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to