[jira] [Updated] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files

Wei-Chiu Chuang (JIRA) Wed, 24 Aug 2016 11:19:33 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-10777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HDFS-10777:
-----------------------------------
    Attachment: HDFS-10777.01.patch

v01: a proof of concept. Whenever DU gets an IOException with message of 
"cannot access (.*): Input/output error", set up a flag. If the flag is on, 
DiskCheck thread scans all directories under the volume.

Given that we are scanning the entire volume only in this specific case, rather 
than scanning blindly, I feel the performance impact should be acceptable.

> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>
>                 Key: HDFS-10777
>                 URL: https://issues.apache.org/jira/browse/HDFS-10777
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10777.01.patch
>
>
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a 
> side-effect that if DU encounters an exception, the exception is caught, 
> logged and ignored, essentially fixes HDFS-9908 (in which case runaway 
> exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there 
> is no immediate action made by the DataNode other than logging the exception. 
> Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few 
> number of directories blindly. If a disk goes bad, it is often possible that 
> only a few files are bad initially and that by checking only a small number 
> of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively 
> verify the files are not accessible, remove the volume, and make the failure 
> visible by showing it in JMX, so that administrators can spot the failure via 
> monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDFS-10777) DataNode should report&remove volume failures if DU cannot access files

Reply via email to