[
https://issues.apache.org/jira/browse/HDFS-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022931#comment-13022931
]
Eli Collins commented on HDFS-1848:
-----------------------------------
Good points Bharath.
I think the DN should explicitly check its volumes for health as it does today
and either fail-fast or tolerate failures appropriately based on the volume
that failed. This may require help from an admin in the form of specifying
critical volumes, or maybe we could detect these automatically.
In general, the DN and TT need to fail-fast when they face unrecoverable
failures, eg if you turn off volume checking and make the root disk read-only
the DN and TT should not try to solider on. Ie some exception handling
situations should result in termination of service, and if possible a shutdown.
> Datanodes should shutdown when a critical volume fails
> ------------------------------------------------------
>
> Key: HDFS-1848
> URL: https://issues.apache.org/jira/browse/HDFS-1848
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: data-node
> Reporter: Eli Collins
> Fix For: 0.23.0
>
>
> A DN should shutdown when a critical volume (eg the volume that hosts the OS,
> logs, pid, tmp dir etc.) fails. The admin should be able to specify which
> volumes are critical, eg they might specify the volume that lives on the boot
> disk. A failure in one of these volumes would not be subject to the threshold
> (HDFS-1161) or result in host decommissioning (HDFS-1847) as the
> decommissioning process would likely fail.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira