[
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13506005#comment-13506005
]
Kihwal Lee commented on HDFS-3875:
----------------------------------
> Does the writer able to close the file successfully?
Yes. In one case the corrupt block ended up in the middle of a file. Of course
all replicas were corrupt, so when Nn tried to up the repl factor, all replicas
got marked corrupt.
I will post my patch for review, not for precomit build, although I ran all
tests and only testBlockCorruptionRecoveryPolicy2 failed probably due to change
in how corruption recovery works. I haven't debugged it yet. Please take a
look and see if the approach seems reasonable.
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: data-node, hdfs client
> Affects Versions: 2.0.2-alpha
> Reporter: Todd Lipcon
> Assignee: Kihwal Lee
> Priority: Blocker
>
> We saw this issue with one block in a large test cluster. The client is
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it
> received from the first node. We don't know if the client sent a bad
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it
> threw an exception. The pipeline started up again with only one replica (the
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and
> unrecoverable since it is the only replica
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira