[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang updated HDFS-10587:
-----------------------------------
    Description: 
We found incorrect offset and length calculation in pipeline recovery may cause 
block corruption and results in missing blocks under a very unfortunate 
scenario. 

(1) A client established pipeline and started writing data to the pipeline.
(2) One of the data node in the pipeline restarted, closing the socket, and 
some written data were unacknowledged.
(3) Client replaced the failed data node with a new one, initiating block 
transfer to copy existing data in the block to the new datanode.
(4) The block is transferred to the new node. Crucially, the entire block, 
including the unacknowledged data, was transferred.
(5) The last chunk (512 bytes) was not a full chunk, but the destination still 
reserved the whole chunk in its buffer, and wrote the entire buffer to disk, 
therefore some written data is garbage.
(6) When the transfer was done, the destination data node converted the replica 
from temporary to rbw, which made its visible length as the length of bytes on 
disk. That is to say, it thought whatever was transferred was acknowledged. 
However, the visible length of the replica is different (round up to the next 
multiple of 512) than the source of transfer. [1]
(7) Client then truncated the block in the attempt to remove unacknowledged 
data. However, because the visible length is equivalent of the bytes on disk, 
it did not truncate unacknowledged data.
(8) When new data was appended to the destination, it skipped the bytes already 
on disk. Therefore, whatever was written as garbage was not replaced.
(9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
wouldn’t tell NameNode to mark the replica as corrupt, so the client continued 
to form a pipeline using the corrupt replica.
(10) Finally the DN that had the only healthy replica was restarted. NameNode 
then update the pipeline to only contain the corrupt replica.
(11) Client continue to write to the corrupt replica, because neither client 
nor the data node itself knows the replica is corrupt. When the restarted 
datanodes comes back, their replica are stale, despite they are not corrupt. 
Therefore, none of the replica is good and up to date.

The sequence of events was reconstructed based on DataNode/NameNode log and my 
understanding of code.
Incidentally, we have observed the same sequence of events on two independent 
clusters.

[1]
The sender has the replica as follows:
2016-04-15 22:03:05,066 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
  getNumBytes()     = 41381376
  getBytesOnDisk()  = 41381376
  getVisibleLength()= 41186444
  getVolume()       = /hadoop-i/data/current
  getBlockFile()    = 
/hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
  bytesAcked=41186444
  bytesOnDisk=41381376

while the receiver has the replica as follows:
2016-04-15 22:03:05,068 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
  getNumBytes()     = 41186816
  getBytesOnDisk()  = 41186816
  getVisibleLength()= 41186816
  getVolume()       = /hadoop-g/data/current
  getBlockFile()    = 
/hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
  bytesAcked=41186816
  bytesOnDisk=41186816

  was:
We found incorrect offset and length calculation in pipeline recovery may cause 
block corruption and results in missing blocks under a very unfortunate 
scenario. 

(1) A client established pipeline and started writing data to the pipeline.
(2) One of the data node in the pipeline restarted, closing the socket, and 
some written data were unacknowledged.
(3) Client replaced the failed data node with a new one, initiating block 
transfer to copy existing data in the block to the new datanode.
(4) The block is transferred to the new node. Crucially, the entire block, 
including the unacknowledged data, was transferred.
(5) The last chunk (512 bytes) was not a full chunk, but the destination still 
reserved the whole chunk in its buffer, and wrote the entire buffer to disk, 
therefore some written data is garbage.
(6) When the transfer was done, the destination data node converted the replica 
from temporary to rbw, which made its visible length as the length of bytes on 
disk. That is to say, it thought whatever was transferred was acknowledged. 
However, the visible length of the replica is different (round up to the next 
multiple of 512) than the source of transfer. [1]
(7) Client then truncated the block in the attempt to remove unacknowledged 
data. However, because the visible length is equivalent of the bytes on disk, 
it did not truncate unacknowledged data.
(8) When new data was appended to the destination, it skipped the bytes already 
on disk. Therefore, whatever was written as garbage was not replaced.
(9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
wouldn’t tell NameNode to mark the replica as corrupt, so the client continued 
to form a pipeline using the corrupt replica.
(10) Finally the DN that had the only healthy replica was restarted. NameNode 
then update the pipeline to only contain the corrupt replica.
(11) Client continue to write to the corrupt replica, because neither client 
nor the data node itself knows the replica is corrupt. When the restarted 
datanodes comes back, their replica are stale, despite they are not corrupt. 
Therefore, none of the replica is good and up to date.

The sequence of events was reconstructed based on DataNode/NameNode log and my 
understanding of code.
Incidentally, we have observed the same sequence of events on two independent 
clusters.

[1]
The sender has the replica as follows:
2016-04-15 22:03:05,066 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
  getNumBytes()     = 41381376
  getBytesOnDisk()  = 41381376
  getVisibleLength()= 41186444
  getVolume()       = /hadoop-i/data/current
  getBlockFile()    = 
/hadoop-i/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
  bytesAcked=41186444
  bytesOnDisk=41381376

while the receiver has the replica as follows:
2016-04-15 22:03:05,068 INFO 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering 
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
  getNumBytes()     = 41186816
  getBytesOnDisk()  = 41186816
  getVisibleLength()= 41186816
  getVolume()       = /hadoop-g/data/current
  getBlockFile()    = 
/hadoop-g/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
  bytesAcked=41186816
  bytesOnDisk=41186816


> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10587
>                 URL: https://issues.apache.org/jira/browse/HDFS-10587
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>         Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes()     = 41381376
>   getBytesOnDisk()  = 41381376
>   getVisibleLength()= 41186444
>   getVolume()       = /hadoop-i/data/current
>   getBlockFile()    = 
> /hadoop-i/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186444
>   bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: 
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
>   getNumBytes()     = 41186816
>   getBytesOnDisk()  = 41186816
>   getVisibleLength()= 41186816
>   getVolume()       = /hadoop-g/data/current
>   getBlockFile()    = 
> /hadoop-g/data/current/BP-1043567091-10.1.1.1-1343682168507/current/rbw/blk_1556997324
>   bytesAcked=41186816
>   bytesOnDisk=41186816



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to