[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719484#action_12719484
 ] 

Sreekanth Ramakrishnan commented on HADOOP-5478:
------------------------------------------------

Adding a little more to discussion, following is approach which I am taking to 
generate a new patch:

* Introduce a new health monitor service which is spawned off by task tracker 
when it starts.
* The service periodically reports the status of the node to the task tracker.
* The protocol is modeled out of {{TaskUmbricalProtocol}}
* The service would receive the host address and port as the command line 
arguments while starting up.
* The service then periodically sends the status update to task tracker based 
on the host and port specified to the service. 
* When TaskTracker is shutdown, the {{NodeHealthChecker}} would not be able to 
contact {{TaskTracker}} and would shut itself down.  The reason why this is 
done, is because task tracker's {{shutdown()}} or {{close()}} is not called 
when we do a {{stop-mapred.sh}} or task tracker can be killed with direct 
{{kill -9 ttpid}} in this case the TT might not inform all the clients which 
contact it to report services.


> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to