[ https://issues.apache.org/jira/browse/HDFS-6937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15313007#comment-15313007 ]
Wei-Chiu Chuang commented on HDFS-6937: --------------------------------------- I am taking over Yongjun's patch because he'll not be able to access Internet for some time. This is a great work and I took some time to understand. I think that instead of throwing an IOException to simulate the injection of checksum failure at the last datanode, it should enqueue a ERROR_CHECKSUM to indicate the checksum failure. Without it, the last DN will shutdown the connection, and the second DN in the pipeline will not understand it's checksum failure. {code:title=BlockReceiver.java#sendAckUpstreamUnprotected} if (ack == null) { // A new OOB response is being sent from this node. Regardless of // downstream nodes, reply should contain one reply. replies = new int[] { myHeader }; } else if (mirrorError) { // ack read error int h = PipelineAck.combineHeader(datanode.getECN(), Status.SUCCESS); int h1 = PipelineAck.combineHeader(datanode.getECN(), Status.ERROR); replies = new int[] {h, h1}; } else { short ackLen = type == PacketResponderType.LAST_IN_PIPELINE ? 0 : ack .getNumOfReplies(); replies = new int[ackLen + 1]; replies[0] = myHeader; for (int i = 0; i < ackLen; ++i) { replies[i + 1] = ack.getHeaderFlag(i); } // If the mirror has reported that it received a corrupt packet, // do self-destruct to mark myself bad, instead of making the // mirror node bad. The mirror is guaranteed to be good without // corrupt data on disk. if (ackLen > 0 && PipelineAck.getStatusFromHeader(replies[1]) == Status.ERROR_CHECKSUM) { throw new IOException("Shutting down writer and responder " + "since the down streams reported the data sent by this " + "thread is corrupt"); } } {code} In this piece of code, if the next DN shutdown the connection, it is always assumed the local DN is good. {code} int h = PipelineAck.combineHeader(datanode.getECN(), Status.SUCCESS); int h1 = PipelineAck.combineHeader(datanode.getECN(), Status.ERROR); replies = new int[] {h, h1}; {code} On the other hand, if the next DN respond with a ERROR_CHECKSUM, it will thrown an IOException, and this will shutdown the connection with the previous DN in the pipeline. In the end, this will replace the middle datanode: {code:title=DataStreamer.java#createBlockOutputStream} // find the datanode that matches if (firstBadLink.length() != 0) { for (int i = 0; i < nodes.length; i++) { // NB: Unconditionally using the xfer addr w/o hostname if (firstBadLink.equals(nodes[i].getXferAddr())) { errorState.setBadNodeIndex(i); break; } } } {code} > Another issue in handling checksum errors in write pipeline > ----------------------------------------------------------- > > Key: HDFS-6937 > URL: https://issues.apache.org/jira/browse/HDFS-6937 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, hdfs-client > Affects Versions: 2.5.0 > Reporter: Yongjun Zhang > Assignee: Yongjun Zhang > Attachments: HDFS-6937.001.patch, HDFS-6937.002.patch > > > Given a write pipeline: > DN1 -> DN2 -> DN3 > DN3 detected cheksum error and terminate, DN2 truncates its replica to the > ACKed size. Then a new pipeline is attempted as > DN1 -> DN2 -> DN4 > DN4 detects checksum error again. Later when replaced DN4 with DN5 (and so > on), it failed for the same reason. This led to the observation that DN2's > data is corrupted. > Found that the software currently truncates DN2's replca to the ACKed size > after DN3 terminates. But it doesn't check the correctness of the data > already written to disk. > So intuitively, a solution would be, when downstream DN (DN3 here) found > checksum error, propagate this info back to upstream DN (DN2 here), DN2 > checks the correctness of the data already written to disk, and truncate the > replica to to MIN(correctDataSize, ACKedSize). > Found this issue is similar to what was reported by HDFS-3875, and the > truncation at DN2 was actually introduced as part of the HDFS-3875 solution. > Filing this jira for the issue reported here. HDFS-3875 was filed by > [~tlipcon] > and found he proposed something similar there. > {quote} > if the tail node in the pipeline detects a checksum error, then it returns a > special error code back up the pipeline indicating this (rather than just > disconnecting) > if a non-tail node receives this error code, then it immediately scans its > own block on disk (from the beginning up through the last acked length). If > it detects a corruption on its local copy, then it should assume that it is > the faulty one, rather than the downstream neighbor. If it detects no > corruption, then the faulty node is either the downstream mirror or the > network link between the two, and the current behavior is reasonable. > {quote} > Thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org