[
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13553193#comment-13553193
]
Suresh Srinivas commented on HDFS-3875:
---------------------------------------
Kihwal, here is how I understand the new behavior. Correct me if I am wrong. In
the following scenarios, client is writing in a pipeline to datanodes d1, d2
and d3. At each point in the pipeline the data is marked as corrupt or not.
client(not corrupt) d1(not corrupt) d2(not corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy even though d2 may not be corrupt.
client(not corrupt) d1(not corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
Only d1 is considered to be valid copy.
client(not corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives
CHECKSUM_ERROR and shutsdown
* _d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR._
d1 is still considered a valid coyp. Is this correct?
client(corrupt) d1(corrupt) d2(corrupt) d3(corrupt)
* d3 detects corrupt and reports CHECKSUM_ERROR ACK to d2
* d2 does not verify checksum and hence status is SUCCESS, but receives
CHECKSUM_ERROR and shutsdown
* d1 does not verify checksum. Its status is SUCCESS + MIRROR_ERROR.
d1 is still considered a valid coyp.
In all the above cases whether a node detects checksum error or the downstream
detects checksum error the results appears the same to the upstream nodes (as
mirror error). Is that what you intended?
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, hdfs-client
> Affects Versions: 2.0.2-alpha
> Reporter: Todd Lipcon
> Assignee: Kihwal Lee
> Priority: Critical
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt,
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt,
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.patch.txt,
> hdfs-3875.trunk.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it
> received from the first node. We don't know if the client sent a bad
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it
> threw an exception. The pipeline started up again with only one replica (the
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and
> unrecoverable since it is the only replica
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira