[
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022857#comment-13022857
]
Eli Collins commented on HDFS-1848:
-----------------------------------
The DN volume checking code tests all directories specified by dfs.data.dir.
And currently up to a configurable # of volumes
(dfs.datanode.failed.volumes.tolerated) may fail and the DN stays on-line.
This jira could have two parts:
# Allow an administrator to designate a sub-set of dfs.data.dir volumes as
critical so the DN will fail-stop rather than tolerate a volume failure. If an
admin puts a data dir on the boot disk they could use this option to indicate
that a failure of a particular dfs.data.dir should not be tolerated. Eg
fs.data.dir might be "/data0, /data1, /data2" and fs.data.dir.critical could be
"/data0". So the DN has three data volumes but will only tolerate the failure
of /data1 and /data2, if /data0 fails the DN should fail.
# The DN should in general fail-stop if the root disk fails (eg prevents it
from writing a tmp file). This is separate from the dfs.data.dir volume
checking, as the root disk might not be listed as a dfs.data.dir volume. The
mechanism could be the same though. Eg a mount could be specified in
fs.data.dir.critical that is not a dfs.data.dir but it would still be checked
via the same mechanism.
> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
> Key: HDFS-1848
> URL: https://issues.apache.org/jira/browse/HDFS-1848
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Reporter: Eli Collins
> Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS,
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which
> volumes are critical, eg they might specify the volume that lives on the boot
> disk. A failure in one of these volumes would not be subject to the threshold
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the
> decommissioning process would likely fail.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira