[
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022816#comment-13022816
]
Eli Collins commented on HDFS-1848:
-----------------------------------
bq. I am wondering if this is necessary? Typically, critical volume (eg the
volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes
down we can safely assume Datanode to be down.
I don't think we should require that datanodes use RAID-1. Raiding the boot
disk (OS, logs, pids etc) on every datanode wastes an extra disk per datanode
in the cluster and requires datanodes have a HW raid controller or use SW raid.
However this just lowers the probability of this volume failing, we still have
to deal with it, and as you point out a datanode can not survive the failure of
the boot disk.
bq. I too am not clear why the datanode process has to watch over "critical"
disks. It would be nice if the datanode considers all disks the same.
The idea is that the datanode can gracefully handle some types of volume
failures but not others. For example the datanode should be able to survive the
failure of a disk that just hosts blocks, but can not survive the failure of a
volume that resides on the boot disk.
Therefore if the volume that resides on the boot disk fails the datanode should
fail-stop and fail-fast (because it can not tolerate this failure) but if a
volume that lives on one of the data disks fails it should continue operating
(or decommission itself if the threshold of volume failures has been reached).
If the datanode considers all disks the same then it doesn't know whether it
should fail itself or tolerate the failure. Make sense?
> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
> Key: HDFS-1848
> URL: https://issues.apache.org/jira/browse/HDFS-1848
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Reporter: Eli Collins
> Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS,
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which
> volumes are critical, eg they might specify the volume that lives on the boot
> disk. A failure in one of these volumes would not be subject to the threshold
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the
> decommissioning process would likely fail.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira