Rushabh S Shah created HDFS-9558:
------------------------------------
Summary: Replication requests always blames the source datanode in
case of Checksum Exception.
Key: HDFS-9558
URL: https://issues.apache.org/jira/browse/HDFS-9558
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Reporter: Rushabh S Shah
Replication requests from datanode (in case of rack failure event) always
blames the source datanode if any of the downstream nodes encounters
ChecksumException.
We saw this case recently in our cluster.
We lost 7 nodes in a rack.
There was only one replica of the block (say on dnA).
The namenode asks dnA to replicate to dnB and dnC.
{noformat}
2015-12-13 21:09:41,798 [DataNode: heartbeating to NN:8020] INFO
datanode.DataNode: DatanodeRegistration(dnA,
datanodeUuid=bc1f183d-b74a-49c9-ab1a-d1d496ab77e9, infoPort=1006,
infoSecurePort=0, ipcPort=8020,
storageInfo=lv=-56;cid=CID-e7f736ac-158e-446e-9091-7e66f3cddf3c;nsid=358250775;c=1428471998571)
Starting thread to transfer
BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 to dnB:1004
dnC:1004
{noformat}
All the packets going out from dnB's interface were getting corrupted.
So dnC received corrupt block and it reported bad block (from dnA) to namenode.
Following are the logs from dnC:
{noformat}
2015-12-13 21:09:43,444 [DataXceiver for client at /dnB:34879 [Receiving block
BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] WARN
datanode.DataNode: Checksum error in block
BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from /dnB:34879
org.apache.hadoop.fs.ChecksumException: Checksum error: at 58368 exp:
-1657951272 got: 856104973
at
org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native
Method)
at
org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
at
org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
at
org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:416)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:550)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:853)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
at java.lang.Thread.run(Thread.java:745)
2015-12-13 21:09:43,445 [DataXceiver for client at dnB:34879 [Receiving block
BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617]] INFO
datanode.DataNode: report corrupt
BP-1620678153-XXXX-1351096255769:blk_3065507810_1107476861617 from datanode
dnA:1004 to namenode
{noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)