[
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373875#comment-15373875
]
Yongjun Zhang edited comment on HDFS-10587 at 7/12/16 10:56 PM:
----------------------------------------------------------------
Thanks [~jojochuang] for adding the log to the jira description.
{code}
The sender has the replica as follows:
2016-04-15 22:03:05,066 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
getNumBytes() = 41381376
getBytesOnDisk() = 41381376
getVisibleLength()= 41186444
getVolume() = /hadoop-i/data/current
getBlockFile() =
/hadoop-i/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
bytesAcked=41186444
bytesOnDisk=41381376
while the receiver has the replica as follows:
2016-04-15 22:03:05,068 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
getNumBytes() = 41186816
getBytesOnDisk() = 41186816
getVisibleLength()= 41186816
getVolume() = /hadoop-g/data/current
getBlockFile() =
/hadoop-g/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
bytesAcked=41186816
bytesOnDisk=41186816
{code}
The sender's visibleLength is 41186444, which is not a multiple of chunks (it's
80442 * 512 + 140), then BlockSender "marks it up" to 41186816 (or 80443 *
512), because there is enough data on the BlockSender DN's disk.
It would be ok for The BlockReceiver DN to receive 41186816 data, as long as it
can skip the already-received data when receiving more data from the client.
But it appears that the BlockReceiver DN is not doing that correctly. If we can
fix that behavior, it would be a good fix for this issue here.
was (Author: yzhangal):
Thanks [~jojochuang] for adding the log to the jira description.
{code}
[1]
The sender has the replica as follows:
2016-04-15 22:03:05,066 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
getNumBytes() = 41381376
getBytesOnDisk() = 41381376
getVisibleLength()= 41186444
getVolume() = /hadoop-i/data/current
getBlockFile() =
/hadoop-i/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
bytesAcked=41186444
bytesOnDisk=41381376
while the receiver has the replica as follows:
2016-04-15 22:03:05,068 INFO
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering
ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
getNumBytes() = 41186816
getBytesOnDisk() = 41186816
getVisibleLength()= 41186816
getVolume() = /hadoop-g/data/current
getBlockFile() =
/hadoop-g/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
bytesAcked=41186816
bytesOnDisk=41186816
{code}
The sender's visibleLength is 41186444, which is not a multiple of chunks (it's
80442 * 512 + 140), then BlockSender "marks it up" to 41186816 (or 80443 *
512), because there is enough data on the BlockSender DN's disk.
It would be ok for The BlockReceiver DN to receive 41186816 data, as long as it
can skip the already-received data when receiving more data from the client.
But it appears that the BlockReceiver DN is not doing that correctly. If we can
fix that behavior, it would be a good fix for this issue here.
> Incorrect offset/length calculation in pipeline recovery causes block
> corruption
> --------------------------------------------------------------------------------
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10587.001.patch
>
>
> We found incorrect offset and length calculation in pipeline recovery may
> cause block corruption and results in missing blocks under a very unfortunate
> scenario.
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block,
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination
> still reserved the whole chunk in its buffer, and wrote the entire buffer to
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the
> replica from temporary to rbw, which made its visible length as the length of
> bytes on disk. That is to say, it thought whatever was transferred was
> acknowledged. However, the visible length of the replica is different (round
> up to the next multiple of 512) than the source of transfer. [1]
> (7) Client then truncated the block in the attempt to remove unacknowledged
> data. However, because the visible length is equivalent of the bytes on disk,
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it
> wouldn’t tell NameNode to mark the replica as corrupt, so the client
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client
> nor the data node itself knows the replica is corrupt. When the restarted
> datanodes comes back, their replica are stale, despite they are not corrupt.
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent
> clusters.
> [1]
> The sender has the replica as follows:
> 2016-04-15 22:03:05,066 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
> getNumBytes() = 41381376
> getBytesOnDisk() = 41381376
> getVisibleLength()= 41186444
> getVolume() = /hadoop-i/data/current
> getBlockFile() =
> /hadoop-i/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
> bytesAcked=41186444
> bytesOnDisk=41381376
> while the receiver has the replica as follows:
> 2016-04-15 22:03:05,068 INFO
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl:
> Recovering ReplicaBeingWritten, blk_1556997324_1100153495099, RBW
> getNumBytes() = 41186816
> getBytesOnDisk() = 41186816
> getVisibleLength()= 41186816
> getVolume() = /hadoop-g/data/current
> getBlockFile() =
> /hadoop-g/data/current/BP-1043567091-10.216.26.120-1343682168507/current/rbw/blk_1556997324
> bytesAcked=41186816
> bytesOnDisk=41186816
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]