[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719991#action_12719991
 ] 

Hemanth Yamijala commented on HADOOP-5478:
------------------------------------------

To add to Sreekanth's comments:

- We are using a new port number for the TT to bind to for the health checker 
script to send updates. The other option was to use the same port as that used 
for the TaskUmbilicalProtocol. We thought the health service should not mix 
with child tasks reporting status and hence kept it different.

- The other important point is about how the health checker stops. Currently, 
the model is similar to how a child stops, in that if it can't report status to 
the TT, it kills itself. This is anyway required because it has to handle the 
case of the TT dying unexpectedly. However this is the extreme case. When the 
TT is stopped normally there are better options to stop the health check 
script. For e.g. we could add a shutdown hook to TT and send a signal to the 
health checker. We could make the health checker a separate daemon as well so 
that stop-mapred could stop it. Any of these options can be easily implemented 
as a follow-up once the basic structure is in place.

Please let us know if these points make sense.



> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to