[
https://issues.apache.org/jira/browse/HDFS-10627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384523#comment-15384523
]
Daryn Sharp commented on HDFS-10627:
------------------------------------
Adding a feedback mechanism would be very useful, but should be a different
jira. I'm sure it's harder than it seems. (I'm not sure why the packet
responder isn't started. I think maybe as an optimization sometimes and/or
suspect it may have to do with recovery needing to copy what appears to have
tail corruption before truncating it. Not my area of expertise...)
This jira however must restore prior behavior so our clusters can actually
detect bad blocks. Latent corruption is going undetected. Legit/detected
corruption is queued for days which increases risk of data loss. DNs are too
busy verifying false positives from clients that didn't fully read the stream.
> Volume Scanner mark a block as "suspect" even if the block sender encounters
> 'Broken pipe' or 'Connection reset by peer' exception
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10627
> URL: https://issues.apache.org/jira/browse/HDFS-10627
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 2.7.0
> Reporter: Rushabh S Shah
> Assignee: Rushabh S Shah
> Attachments: HDFS-10627.patch
>
>
> In the BlockSender code,
> {code:title=BlockSender.java|borderStyle=solid}
> if (!ioem.startsWith("Broken pipe") && !ioem.startsWith("Connection
> reset")) {
> LOG.error("BlockSender.sendChunks() exception: ", e);
> }
> datanode.getBlockScanner().markSuspectBlock(
> volumeRef.getVolume().getStorageID(),
> block);
> {code}
> Before HDFS-7686, the block was marked as suspect only if the exception
> message doesn't start with Broken pipe or Connection reset.
> But after HDFS-7686, the block is marked as corrupt irrespective of the
> exception message.
> In one of our datanode, it took approximately a whole day (22 hours) to go
> through all the suspect blocks to scan one corrupt block.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]