[ 
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13026355#comment-13026355
 ] 

Steve Loughran commented on HDFS-1848:
--------------------------------------

+1 for more healthchecking, with easy ways to specify what you want to check 
(presumably a script to exec is the option of choice, or some java class to 
call)

Some standard checks for hdds (you see them in ant -diagnostics)
 -can you write to a dir
 -can you get back what you wrote
 -is the timestamp of the file roughly in sync with your clock (on network 
drives it may not be)

If you are aggressive you could try to create a large file and see what 
happens, though if the health check hangs, something else will need to detect 
that and report it as a failure.

Log drives cause problems when they aren't there or are full too.

> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, 
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which 
> volumes are critical, eg they might specify the volume that lives on the boot 
> disk. A failure in one of these volumes would not be subject to the threshold 
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the 
> decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to