Wei-Chiu Chuang resolved HDFS-10777.
    Resolution: Invalid

Close this jira as invalid and I'll file an improvement jira to add logging or 
metric when DataNode disks become flaky.

> DataNode should report&remove volume failures if DU cannot access files
> -----------------------------------------------------------------------
>                 Key: HDFS-10777
>                 URL: https://issues.apache.org/jira/browse/HDFS-10777
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 2.8.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10777.01.patch
> HADOOP-12973 refactored DU and makes it pluggable. The refactory has a 
> side-effect that if DU encounters an exception, the exception is caught, 
> logged and ignored, essentially fixes HDFS-9908 (in which case runaway 
> exceptions prevent DataNodes from handshaking with NameNodes).
> However, this "fix" is not good, in the sense that if the disk is bad, there 
> is no immediate action made by the DataNode other than logging the exception. 
> Existing {{FsDatasetSpi#checkDataDir}} has been reduced to only check a few 
> number of directories blindly. If a disk goes bad, it is often possible that 
> only a few files are bad initially and that by checking only a small number 
> of directories it is easy to overlook the degraded disk.
> I propose: in addition to logging the exception, DataNode should proactively 
> verify the files are not accessible, remove the volume, and make the failure 
> visible by showing it in JMX, so that administrators can spot the failure via 
> monitoring systems.
> A different fix, based on HDFS-9908, is needed before Hadoop 2.8.0

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to