[ 
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022816#comment-13022816
 ] 

Eli Collins commented on HDFS-1848:
-----------------------------------

bq. I am wondering if this is necessary? Typically, critical volume (eg the 
volume that hosts the OS, logs, pid, tmp dir etc.) is RAID-1 and if this goes 
down we can safely assume Datanode to be down.

I don't think we should require that datanodes use RAID-1. Raiding the boot 
disk (OS, logs, pids etc) on every datanode wastes an extra disk per datanode 
in the cluster and requires datanodes have a HW raid controller or use SW raid. 
However this just lowers the probability of this volume failing, we still have 
to deal with it, and as you point out a datanode can not survive the failure of 
the boot disk.

bq. I too am not clear why the datanode process has to watch over "critical" 
disks. It would be nice if the datanode considers all disks the same.

The idea is that the datanode can gracefully handle some types of volume 
failures but not others. For example the datanode should be able to survive the 
failure of a disk that just hosts blocks, but can not survive the failure of a 
volume that resides on the boot disk. 

Therefore if the volume that resides on the boot disk fails the datanode should 
fail-stop and fail-fast (because it can not tolerate this failure) but if a 
volume that lives on one of the data disks fails it should continue operating 
(or decommission itself if the threshold of volume failures has been reached). 
If the datanode considers all disks the same then it doesn't know whether it 
should fail itself  or tolerate the failure. Make sense?

> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
>                 Key: HDFS-1848
>                 URL: https://issues.apache.org/jira/browse/HDFS-1848
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: data-node
>            Reporter: Eli Collins
>             Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS, 
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which 
> volumes are critical, eg they might specify the volume that lives on the boot 
> disk. A failure in one of these volumes would not be subject to the threshold 
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the 
> decommissioning process would likely fail.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to