[ 
https://issues.apache.org/jira/browse/HADOOP-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620032#action_12620032
 ] 

Steve Loughran commented on HADOOP-3893:
----------------------------------------

Thinking about this, the logic to validate the different services should be in 
the services -and needs to driven off their configuration data. So it should 
really go into the startup routines of the services -after reading their state 
in, they could do more rigorous checks. And they'd run every time the service 
started up, which is what you (normally) want.

One thing we could do is have a healthchecker class to aid this, with methods 
to 
 -assert that the underlying OS is valid
 -assert that the underlying JVM is supported (warn if not)
 -check the target directories supports locks and is writeable by the current 
user
these wouldnt be side-effecting, but would cause a service to fail sooner 
rather than later if the hosting server isn't set up right.

> Add hadoop health check/diagnostics to run from command line, JSP pages, 
> other tools
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3893
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3893
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs, mapred
>    Affects Versions: 0.19.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> If the lifecycle ping() is for short-duration "are we still alive" checks, 
> Hadoop still needs something bigger to check the overall system health,.This 
> would be for end users, but also for automated cluster deployment, a complete 
> validation of the cluster, 
> It could be a command line tool, and something that runs on different nodes, 
> checked via IPC or JSP. the idea would be to do thorough checks with good 
> diagnostics.  Oh, and they should be executable through JUnit too.
> For example
>  -if running on windows, check that cygwin is on the path, fail with a 
> pointer to a wiki issue if not
>  -datanodes should check that it can create locks on the filesystem, create 
> files, timestamps are (roughly) aligned with local time.
>  -namenodes should try and create files/locks in the filesystem
>  -task tracker should try and exec() something
>  -run through the classpath and look for problems; duplicate JARs, 
> unsupported java, xerces versions, etc.
> * The number of tests should be extensible -rather than one single class with 
> all the tests, there'd be something separate for name, task, data, job 
> tracker nodes
> * They can't be in the nodes themselves, as they should be executable even if 
> the nodes don't come up. 
> * output could be in human readable text or html, and a form that could be 
> processed through hadoop itself in future
> * these tests could have side effects, such as actually trying to submit work 
> to a cluster

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to