[
https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDFS-10777:
-----------------------------------
Attachment: HDFS-10777.01.patch
v01: a proof of concept. Whenever DU gets an IOException with message of
"cannot access (.*): Input/output error", set up a flag. If the flag is on,
DiskCheck thread scans all directories under the volume.
Given that we are scanning the entire volume only in this specific case, rather
than scanning blindly, I feel the performance impact should be acceptable.
> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>
> Key: HDFS-10777
> URL: https://issues.apache.org/jira/browse/HDFS-10777
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Affects Versions: 2.8.0
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10777.01.patch
>
>
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a
> side-effect that if DU encounters an exception, the exception is caught,
> logged and ignored, essentially fixes HDFS-9908 (in which case runaway
> exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there
> is no immediate action made by the DataNode other than logging the exception.
> Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few
> number of directories blindly. If a disk goes bad, it is often possible that
> only a few files are bad initially and that by checking only a small number
> of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively
> verify the files are not accessible, remove the volume, and make the failure
> visible by showing it in JMX, so that administrators can spot the failure via
> monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]