[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720731#action_12720731
 ] 

Steve Loughran commented on HADOOP-5478:
----------------------------------------

> Though, at both times, the only one who knows about the trouble is the health 
> checker and not the rest of the world.

why is why your management tools
# need to be HA toys themselves
# need to be able to ask the apps for their health
# may need to be able to do test jobs to probe system health
# may need the ability to react to failure according to the infrastructure in 
which HDFS is running, and your policy. 

If HDFS is running in anything that supports the EC2 APIs, if a TT is playing 
up I'd start by rebooting that node, if it still doesn't come up, decomission 
the namenode, terminate the VM and ask for a new one. That's a very different 
policy from a physical cluster, where you may want to blacklist the TT while 
its datanode services stays live. 

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Sreekanth Ramakrishnan
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch, hadoop-5478-5.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to