[
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720122#action_12720122
]
Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------
This is a reasonably complex patch, and in the current state, it has already
come a long way. I do have the following comments:
Comments on the spec:
- It may be useful to allow arguments to be provided to the health check script.
Shell:
- It seems like it will be easier to use a Timer to do the interrupting rather
than implementing the logic ourselves.
- The timeout thread seems to be being interrupted and joined twice.
- Do we need the setTimeoutInterval ? What happens if it is set after the
command has started. I think it is better to have it set only once.
Configuration:
- Default value for mapred.healthChecker.interval can be a little larger, like
60000.
- Let's call mapred.healthChecker.failure_interval
mapred.healthChecker.script.timeout. Likewise the variable in NodeHealthChecker
failureintervalTime as scriptTimeout or some such name.
NodeHealthCheckerService:
- Instead of setting the health status in the TaskTracker, I think the variable
can simply be stored as a local variable by NodeHealthCheckerService, and TT
can call it using an appropriately synchronized accessor method.
- If the thread launching the health checker VM is a daemon, I don't think
there's a need to join it in close.
- Does NodeHealthCheckerService need to be a public class ?
- Please give a name to the thread launching the health checker VM
NodeHealthChecker:
- Typo: HEALTH_CHECK_INTERVAL_PREOPERTY should be HEALTH_CHECK_INTERVAL_PROPERTY
- If there is an exception in executing the script - should we treat this as a
failure ? As a follow-up, should we ignore the output in such a case, as it may
not be meaningful to rely upon ?
- A related question is what to do if the output is null. Currently we don't
seem to be reporting anything at all. This clearly seems like a bug. I think if
it did not timeout, this should be treated as a success.
- Also, should we be more careful in where we check for ERROR. I think the
output should begin with the string ERROR, rather than ERROR appearing anywhere
in the string.
- I think we should check whether the script timed out before we check the
output. It reads more logically that way.
- I would also suggest we move the code in the finally block to a separate
method for making it more obvious. Among other things it would help us unit
test this very important logic easily.
- I think the check for the health check script should be done in
NodeHealthCheckerService. This will avoid us the cost of launching the VM and
then failing. However, basic checks for the script can still be there in
NodeHealthChecker so it can be used as an independent unit.
- Where do the log messages of the NodeHealthChecker go ? I think it should go
to a separate log file.
Give a name to the Timer object created for running the health check script.
JobTracker:
- unBlacklistTracker may not blacklist a tracker if the reason for blacklisting
was different. Shouldn't this be checked whenever a call is made, and the
subsequent actions be done only if unBlacklistTracker returns a success.
- When tasktracker status is updated because the node becomes healthy,
shouldn't isBlackListed be set to true ?
TaskTracker:
- shutdown() seems to have got an extraneous log message by mistake.
Test cases:
- In the testcase, can we also verify the healthReport value for unhealthy
trackers ?
- I think we also need a testcase that combines task failures with health
script failing and make sure we don't unblacklist a tracker if it was
blacklisted due to another reason.
Documentation:
- New classes (like NodeHealthCheckerService) need Javadoc. Indeed, please
document all new methods.
- We also need Forrest documentation on this super cool feature. *smile*
> Provide a node health check script and run it periodically to check the node
> health status
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-5478
> URL: https://issues.apache.org/jira/browse/HADOOP-5478
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.20.0
> Reporter: Aroop Maliakkal
> Assignee: Vinod K V
> Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch,
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It
> should run the health check script periodically and if there is any errors,
> it should black list the node. This will be really helpful when we run static
> mapred clusters. Else we may have to run some scripts/daemons periodically to
> find the node status and take it offline manually.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.