[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15376897#comment-15376897
 ] 

xupeng commented on HDFS-10587:
-------------------------------

hi [~vinayrpet] :

logs related are listed below

134.228
{noformat}
2016-07-13 11:48:29,528 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataTransfer: Transmitted blk_1116167880_42905642 (numBytes=9911790) to 
/10.6.134.229:5080
2016-07-13 11:48:29,552 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.130.44:26319 dest: 
/10.6.134.228:5080
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replicablk_1116167880_42905642
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes()     = 9912487
  getBytesOnDisk()  = 9912487
  getVisibleLength()= 9911790
  getVolume()       = /current
  getBlockFile()    = /current/current/rbw/blk_1116167880
  bytesAcked=9911790
  bytesOnDisk=9912487
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: blockFile=/current/current/rbw/blk_1116167880, 
metaFile=/current/current/rbw/blk_1116167880_42905642.meta, oldlen=9912487, 
newlen=9911790
016-07-13 11:49:01,566 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42906656 src: /10.6.130.44:26617 dest: 
/10.6.134.228:5080
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica blk_1116167880_42906656
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42906656, RBW
  getNumBytes()     = 15104963
  getBytesOnDisk()  = 15104963
  getVisibleLength()= 15102415
  getVolume()       = /current
  getBlockFile()    = /current/current/rbw/blk_1116167880
  bytesAcked=15102415
  bytesOnDisk=15104963
2016-07-13 11:49:01,566 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: blockFile=/current/rbw/blk_1116167880, 
metaFile=/current/rbw/blk_1116167880_42906656.meta, oldlen=15104963, 
newlen=15102415
2016-07-13 11:49:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Datanode 2 got response for connect ack  from downstream datanode with 
firstbadlink as 10.6.129.77:5080
2016-07-13 11:49:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Datanode 2 forwarding connect ack to upstream firstbadlink is 10.6.129.77:5080
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: blk_1116167880_42907145, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2225)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1179)
        at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for blk_1116167880_42907145
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
{noformat}


134.229
{noformat}
2016-07-13 11:48:29,488 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.134.228:24286 dest: 
/10.6.134.229:5080
2016-07-13 11:48:29,516 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Convert 
blk_1116167880_42905642 from Temporary to RBW, visible length=9912320
2016-07-13 11:48:29,552 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42905642 src: /10.6.134.228:24321 dest: 
/10.6.134.229:5080
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica blk_1116167880_42905642
2016-07-13 11:48:29,552 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42905642, RBW
  getNumBytes()     = 9912320
  getBytesOnDisk()  = 9912320
  getVisibleLength()= 9912320
  getVolume()       = /current
  getBlockFile()    = /current/rbw/blk_1116167880
  bytesAcked=9912320
  bytesOnDisk=9912320

2016-07-13 11:49:01,501 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: blk_1116167880_42906656, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.IOException: Connection reset by peer
016-07-13 11:49:01,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: blk_1116167880_42906656, type=HAS_DOWNSTREAM_IN_PIPELINE 
terminating
2016-07-13 11:49:01,505 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1116167880_42906656 received exception java.io.IOException: 
Premature EOF from inputStream
2016-07-13 11:49:01,506 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataXceiver error processing WRITE_BLOCK operation  src: /10.6.134.228:24321 
dst: /10.6.134.229:5080
java.io.IOException: Premature EOF from inputStream
        at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:789)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:917)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,541 INFO 
org.apache.hadoop.hdfs.server.datanode.BlockScanner: Not scanning suspicious 
block blk_1116167880_42906656 on DS-8c209fca-9b34-4a6b-919b-6b4d24a3e13a, 
because the block scanner is disabled.
2016-07-13 11:49:01,542 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
:Failed to transfer blk_1116167880_42906656 to 10.6.129.77:5080 got
java.net.SocketException: Original Exception : java.io.IOException: Connection 
reset by peer
        at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
        at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
        at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
        at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
        at 
org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:576)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:745)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2166)
        at 
org.apache.hadoop.hdfs.server.datanode.DataNode.transferReplicaForPipelineRecovery(DataNode.java:2895)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.transferBlock(DataXceiver.java:980)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opTransferBlock(Receiver.java:203)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:92)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
        ... 14 more
2016-07-13 11:49:01,567 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
blk_1116167880_42906656 src: /10.6.134.228:46441 dest: /10.6.134.229:5080
2016-07-13 11:49:01,567 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica blk_1116167880_42906656
2016-07-13 11:49:01,567 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1116167880_42906656, RBW
  getNumBytes()     = 15104963
  getBytesOnDisk()  = 15104963
  getVisibleLength()= 15102415
  getVolume()       = /current
  getBlockFile()    = /current/rbw/blk_1116167880
  bytesAcked=15102415
  bytesOnDisk=15104963
2016-07-13 11:49:01,567 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
truncateBlock: blockFile=/current/rbw/blk_1116167880, 
metaFile=current/rbw/blk_1116167880_42906656.meta, oldlen=15104963, 
newlen=15102415
2016-07-13 11:49:01,569 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
:Exception transfering block blk_1116167880_42907145 to mirror 
10.6.129.77:5080: java.io.EOFException: Premature EOF: no length prefix 
available
2016-07-13 11:49:01,569 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1116167880_42907145 received exception java.io.EOFException: 
Premature EOF: no length prefix available
2016-07-13 11:49:01,569 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataXceiver error processing WRITE_BLOCK operation  src: /10.6.134.228:46441 
dst: /10.6.134.229:5080
java.io.EOFException: Premature EOF: no length prefix available
        at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2225)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:858)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

129.77
{noformat}
2016-07-13 11:49:01,512 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42906656 src: /10.6.134.229:43844 dest: 
/10.6.129.77:5080
2016-07-13 11:49:01,543 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Checksum error in block blk_1116167880_42906656 from /10.6.134.229:43844
org.apache.hadoop.fs.ChecksumException: Checksum error: 
DFSClient_NONMAPREDUCE_2019484565_1 at 81920 exp: 1352119728 got: -1012279895
        at 
org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native 
Method)
        at 
org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69)
        at 
org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347)
        at 
org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:421)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:558)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:789)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:917)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,543 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception for blk_1116167880_42906656
java.io.IOException: Terminating due to a checksum error.java.io.IOException: 
Unexpected checksum mismatch while writing blk_1116167880_42906656 from 
/10.6.134.229:43844
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:571)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:789)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:917)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,543 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1116167880_42906656 received exception java.io.IOException: 
Terminating due to a checksum error.java.io.IOException: Unexpected checksum 
mismatch while writing blk_1116167880_42906656 from /10.6.134.229:43844
2016-07-13 11:49:01,543 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:  
error processing WRITE_BLOCK operation  src: /10.6.134.229:43844 dst: 
/10.6.129.77:5080
java.io.IOException: Terminating due to a checksum error.java.io.IOException: 
Unexpected checksum mismatch while writing blk_1116167880_42906656 from 
/10.6.134.229:43844
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:571)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:789)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:917)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving blk_1116167880_42906656 src: /10.6.134.229:43878 dest: 
/10.6.129.77:5080
2016-07-13 11:49:01,570 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover 
RBW replica BP-448958278-10.6.130.96-1457941856632:blk_1116167880_42906656
2016-07-13 11:49:01,570 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
opWriteBlock blk_1116167880_42906656 received exception 
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot recover 
a non-RBW replica ReplicaInPipeline, blk_1116167880_42906656, TEMPORARY
  getNumBytes()     = 9961472
  getBytesOnDisk()  = 9830400
  getVisibleLength()= -1
  getVolume()       = /current
  getBlockFile()    = /current/tmp/blk_1116167880
  bytesAcked=0
  bytesOnDisk=9830400
2016-07-13 11:49:01,571 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DataXceiver error processing WRITE_BLOCK operation  src: /10.6.134.229:43878 
dst: /10.6.129.77:5080
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot recover 
a non-RBW replica ReplicaInPipeline, blk_1116167880_42906656, TEMPORARY
  getNumBytes()     = 9961472
  getBytesOnDisk()  = 9830400
  getVisibleLength()= -1
  getVolume()       = /current
  getBlockFile()    =/current/tmp/blk_1116167880
  bytesAcked=0
  bytesOnDisk=9830400
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverRbw(FsDatasetImpl.java:1273)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:190)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:799)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:174)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:80)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:244)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10587
>                 URL: https://issues.apache.org/jira/browse/HDFS-10587
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes()     = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()       = /hadoop-i/data/current
>   getBlockFile()    = 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes()     = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()       = /hadoop-g/data/current
>   getBlockFile()    = 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to