[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

Chen Zhang (JIRA) Tue, 30 Jul 2019 08:16:22 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896219#comment-16896219
 ]


Chen Zhang commented on HDFS-13709:
-----------------------------------

Thanks [~jojochuang] for your detailed comments.

1. For your first comments:
{quote}I thought we already verify checksum during block transfer, but I was 
wrong. Here's the code in {{DataNode#transferBlock}}
{quote}
I've checked the code there in detail, actually the checksum verification work 
is did by the BlockReceiver during block transfer
{code:java}
// DataNode.java
2573 blockSender = new BlockSender(b, 0, b.getNumBytes(),
2574        false, false, true, DataNode.this, null, cachingStrategy); {code}
The sixth parameter is true, which will make blockSender send the checksum to 
peer. {{BlockReceiver#verifyChunks()}} will call {{reportRemoteBadBlock}}() 
when checksum error

But this case, checksum verification won't help. EIO will simply abort the 
transfer block procedure, no one knows the replica is corrupted if it's not 
accessed by client or VolumeScanner.

 

2. For your second suggestion:
{quote}It would be great if we can consolidate the error handling to support 
both cases
{quote}
It's a little different between these 2 logics:
 * VolumeScanner report bad block for all {{IOException}} besides 
{{FileNotFoundException}}, because it just scan disk, all IOException comes 
from disk I/O.
 * DataTransfer thread should only report bad block when the block access 
reports EIO error, because {{IOException}} is very normal during data transfer 
on network and it is hard to identify the root cause.

I'm glad to consolidate the error handling to support both, but I can't figure 
out a good way of doing that, do you have any idea?

Thanks again.

> Report bad block to NN when transfer block encounter EIO exception
> ------------------------------------------------------------------
>
>                 Key: HDFS-13709
>                 URL: https://issues.apache.org/jira/browse/HDFS-13709
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>            Reporter: Chen Zhang
>            Assignee: Chen Zhang
>            Priority: Major
>         Attachments: HDFS-13709.patch
>
>
> In our online cluster, the BlockPoolSliceScanner is turned off, and sometimes 
> disk bad track may cause data loss.
> For example, there are 3 replicas on 3 machines A/B/C, if a bad track occurs 
> on A's replica data, and someday B and C crushed at the same time, NN will 
> try to replicate data from A but failed, this block is corrupt now but no one 
> knows, because NN think there is at least 1 healthy replica and it keep 
> trying to replicate it.
> When reading a replica which have data on bad track, OS will return an EIO 
> error, if DN reports the bad block as soon as it got an EIO,  we can find 
> this case ASAP and try to avoid data loss



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13709) Report bad block to NN when transfer block encounter EIO exception

Reply via email to