[ 
https://issues.apache.org/jira/browse/HADOOP-5478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720061#action_12720061
 ] 

Steve Loughran commented on HADOOP-5478:
----------------------------------------

1. The timeouts in Shell would seem useful on their own; every shell operation 
ought to have timeouts for extra robustness.

2. This would fit fairly well under the HADOOP-3628 stuff, where the monitor 
would start and stop with the TT lifecycle; we'd have to think about how to 
integrate it with the ping operation -I think returning the most recent status 
would be good.


3. At some point in the future, it would be good for the policy of acting on TT 
failure to be moved out of the JT. In infrastructure where the response to 
failure is to terminate that (virtual) host and ask for a new one, you react 
very differently to failure. It's not something the JT needs to handle, other 
than pass up bad news.

4. I'm not sure about all the kill -9 and shutdown hook stuff, it's getting 
into fragile waters. Hard to test, hard to debug, creates complex situations 
especially  in test runs or stuff hosted in different runtimes

* this helper script stuff must be optional; I would turn it off on my systems 
as I test health in different ways.
* kill handlers are best designed to do very little and be robust against odd 
system states -and not assume any other parts of the cluster are live.

For the curious, the way SmartFrog  manages is its health is that every 
component tracks the last time it was asked by its parent for its health, if 
that time ever exceeds a (programmed) limit then it terminates itself. Every 
process pings the root component; its up to that to ping its children and act 
on failures -and to  recognise and act on timeouts. This works OK for single 
host work, in a cluster you don't want any SPOFs and tend to take an aggregate 
view : there has to be one Namenode, one JT, "enough" workers. I have a 
component to check the health of a file in the filesystem; every time it's 
health is checked, it looks for the file it was bound to, checks that it is 
present and within a specified size range. This is handy for checking that 
files you value are there, and that the FS is visible across the network (very 
important on virtual servers with odd networking). I dont have anything similar 
for checking that TT's are good, the best check would be test work.

> Provide a node health check script and run it periodically to check the node 
> health status
> ------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5478
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5478
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Aroop Maliakkal
>            Assignee: Vinod K V
>         Attachments: hadoop-5478-1.patch, hadoop-5478-2.patch, 
> hadoop-5478-3.patch, hadoop-5478-4.patch
>
>
> Hadoop must have some mechanism to find the health status of a node . It 
> should run the health check script periodically and if there is any errors, 
> it should black list the node. This will be really helpful when we run static 
> mapred clusters. Else we may have to run some scripts/daemons periodically to 
> find the node status and take it offline manually.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to