Wei-Chiu Chuang created HDFS-10777:
--------------------------------------
Summary: DataNode should report&remove volume failures if DU
cannot access files
Key: HDFS-10777
URL: https://issues.apache.org/jira/browse/HDFS-10777
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Affects Versions: 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
HADOOP-12973 refactored DU and makes it pluggable. The refactory has a
side-effect that if DU encounters an exception, the exception is caught, logged
and ignored, essentially fixes HDFS-9908 (in which case runaway exceptions
prevent DataNodes from handshaking with NameNodes).
However, this "fix" is not good, in the sense that if the disk is bad, there is
no immediate action made by the DataNode other than logging the exception.
Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few
number of directories blindly. If a disk goes bad, it is often possible that
only a few files are bad initially and that by checking only a small number of
directories it is easy to overlook the degraded disk.
I propose: in addition to logging the exception, DataNode should proactively
verify the files are not accessible, remove the volume, and make the failure
visible by showing it in JMX, so that administrators can spot the failure via
monitoring systems.
A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]