[ https://issues.apache.org/jira/browse/HDFS-9558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated HDFS-9558: ----------------------------- Target Version/s: (was: 2.8.0) > Replication requests from datanode always blames the source datanode in case > of Checksum Exception. > --------------------------------------------------------------------------------------------------- > > Key: HDFS-9558 > URL: https://issues.apache.org/jira/browse/HDFS-9558 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Reporter: Rushabh S Shah > > Replication requests from datanode (in case of rack failure event) always > blames the source datanode if any of the downstream nodes encounters > ChecksumException. > We saw this case recently in our cluster. > We lost 7 nodes in a rack. > There was only one replica of the block (say on dnA). > The namenode asks dnA to replicate to dnB and dnC. > {noformat} > 2015-12-13 21:09:41,798 [DataNode: heartbeating to NN:8020] INFO > datanode.DataNode: DatanodeRegistration(dnA, > datanodeUuid=bc1f183d-b74a-49c9-ab1a-d1d496ab77e9, infoPort=1006, > infoSecurePort=0, ipcPort=8020, > storageInfo=lv=-56;cid=CID-e7f736ac-158e-446e-9091-7e66f3cddf3c;nsid=358250775;c=1428471998571) > Starting thread to transfer > BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 to dnB:1004 > dnC:1004 > {noformat} > All the packets going out from dnB's interface were getting corrupted. > So dnC received corrupt block and it reported bad block (from dnA) to > namenode. > Following are the logs from dnC: > {noformat} > 2015-12-13 21:09:43,444 [DataXceiver for client at /dnB:34879 [Receiving > block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] WARN > datanode.DataNode: Checksum error in block > BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from /dnB:34879 > org.apache.hadoop.fs.ChecksumException: Checksum error: at 58368 exp: > -1657951272 got: 856104973 > at > org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:416) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:550) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:853) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237) > at java.lang.Thread.run(Thread.java:745) > 2015-12-13 21:09:43,445 [DataXceiver for client at dnB:34879 [Receiving > block BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] INFO > datanode.DataNode: report corrupt > BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from datanode > dnA:1004 to namenode > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org