[
https://issues.apache.org/jira/browse/HDFS-3875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13508134#comment-13508134
]
Tsz Wo (Nicholas), SZE commented on HDFS-3875:
----------------------------------------------
Hi Kihwal,
In a client write pipeline, only the last datanode verifies checksum. If there
is a checksum error, we don't know what goes wrong. It could be the cases that
one of the datanodes is faulty or a network path is faulty. So, the client
must stop but cannot simply take out a datanode and continue. Do you agree?
In the patch, only the last datanode possibly reports checksum error. If it
does, all statuses in the ack become ERROR_CHECKSUM. The approach seems
reasonable.
Some questions on the patch:
- receivePacket() returns -1 for checksum error. Why not throw an exception?
Returning -1 should mean exit normally.
- The exception caught is not used below. Should it re-throw the exception?
{code}
+ if (shouldVerifyChecksum()) {
+ try {
+ verifyChunks(dataBuf, checksumBuf);
+ } catch (IOException e) {
+ // checksum error detected locally. there is no reason to continue.
+ if (responder != null) {
+ ((PacketResponder) responder.getRunnable()).enqueue(seqno,
+ lastPacketInBlock, offsetInBlock,
+ Status.ERROR_CHECKSUM);
+ }
+ // return without writing data.
+ checksumError = true;
+ return -1;
+ }
{code}
> Issue handling checksum errors in write pipeline
> ------------------------------------------------
>
> Key: HDFS-3875
> URL: https://issues.apache.org/jira/browse/HDFS-3875
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, hdfs-client
> Affects Versions: 2.0.2-alpha
> Reporter: Todd Lipcon
> Assignee: Kihwal Lee
> Priority: Blocker
> Attachments: hdfs-3875.branch-0.23.no.test.patch.txt,
> hdfs-3875.branch-0.23.with.test.patch.txt, hdfs-3875.trunk.no.test.patch.txt,
> hdfs-3875.trunk.no.test.patch.txt, hdfs-3875.trunk.with.test.patch.txt,
> hdfs-3875.trunk.with.test.patch.txt, hdfs-3875-wip.patch
>
>
> We saw this issue with one block in a large test cluster. The client is
> storing the data with replication level 2, and we saw the following:
> - the second node in the pipeline detects a checksum error on the data it
> received from the first node. We don't know if the client sent a bad
> checksum, or if it got corrupted between node 1 and node 2 in the pipeline.
> - this caused the second node to get kicked out of the pipeline, since it
> threw an exception. The pipeline started up again with only one replica (the
> first node in the pipeline)
> - this replica was later determined to be corrupt by the block scanner, and
> unrecoverable since it is the only replica
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira