[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

Stephen O'Donnell (JIRA) Tue, 13 Aug 2019 08:48:19 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16906329#comment-16906329
 ]


Stephen O'Donnell commented on HDFS-13709:
------------------------------------------

[~jojochuang] pointed me to this Jira as I am working on the related one in 
HDFS-14706. I just have on minor comment here. In your definition of the new 
exception class "DiskFileCorruptException", if you add a method like:

 
{code:java}
public DiskFileCorruptException(String msg, Throwable cause) {
    super(msg, cause);
}{code}
Then you can avoid having to adjust the stack trace etc when you create this 
exception, so you can change this:
{code:java}
+        if (ioe.getMessage().startsWith(EIO_ERROR)) {
+          DiskFileCorruptException de = new DiskFileCorruptException("Original 
Exception : " + ioe);
+          de.initCause(ioe);
+          de.setStackTrace(ioe.getStackTrace());
+          throw de;
+        }{code}
To just this:
{code:java}
if (ioe.getMessage().startsWith(EIO_ERROR)) {
  throw new DiskFileCorruptException("A disk IO error occurred", ioe);
}{code}

> Report bad block to NN when transfer block encounter EIO exception
> ------------------------------------------------------------------
>
>                 Key: HDFS-13709
>                 URL: https://issues.apache.org/jira/browse/HDFS-13709
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-13709.002.patch, HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

Reply via email to