[
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719991#action_12719991
]
Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------
To add to Sreekanth's comments:
- We are using a new port number for the TT to bind to for the health checker
script to send updates. The other option was to use the same port as that used
for the TaskUmbilicalProtocol. We thought the health service should not mix
with child tasks reporting status and hence kept it different.
- The other important point is about how the health checker stops. Currently,
the model is similar to how a child stops, in that if it can't report status to
the TT, it kills itself. This is anyway required because it has to handle the
case of the TT dying unexpectedly. However this is the extreme case. When the
TT is stopped normally there are better options to stop the health check
script. For e.g. we could add a shutdown hook to TT and send a signal to the
health checker. We could make the health checker a separate daemon as well so
that stop-mapred could stop it. Any of these options can be easily implemented
as a follow-up once the basic structure is in place.
Please let us know if these points make sense.
> Provide a node health check script and run it periodically to check the node
> health status
> ------------------------------------------------------------------------------------------
>
> Key: HADOOP-5478
> URL: https://issues.apache.org/jira/browse/HADOOP-5478
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.20.0
> Reporter: Aroop Maliakkal
> Assignee: Vinod K V
> Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch,
> hadoop-5478-3.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It
> should run the health check script periodically and if there is any errors,
> it should black list the node. This will be really helpful when we run static
> mapred clusters. Else we may have to run some scripts/daemons periodically to
> find the node status and take it offline manually.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.