[
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15368564#comment-15368564
]
Yongjun Zhang commented on HDFS-10587:
--------------------------------------
Hi [~jojochuang],
Thanks a lot for the investigation/findings and the jira!
I did some study and below is what I found:
The acknowledged length is what has been acknowledged to the client (the
writer). The client will continue to write data from there on after the new
pipeline is constructed.
{quote}
(5) The last chunk (512 bytes) was not a full chunk, but the destination still
reserved the whole chunk in its buffer, and wrote the entire buffer to disk,
therefore some written data is garbage.
{quote}
The code that does the padding is in BlockSender's constructor:
{code}
// end is either last byte on disk or the length for which we have a
// checksum
long end = chunkChecksum != null ? chunkChecksum.getDataLength()
: replica.getBytesOnDisk();
if (startOffset < 0 || startOffset > end
|| (length + startOffset) > end) {
String msg = " Offset " + startOffset + " and length " + length
+ " don't match block " + block + " ( blockLen " + end + " )";
LOG.warn(datanode.getDNRegistrationForBP(block.getBlockPoolId()) +
":sendBlock() : " + msg);
throw new IOException(msg);
}
// Ensure read offset is position at the beginning of chunk
offset = startOffset - (startOffset % chunkSize);
if (length >= 0) {
// Ensure endOffset points to end of chunk.
long tmpLen = startOffset + length;
if (tmpLen % chunkSize != 0) {
tmpLen += (chunkSize - tmpLen % chunkSize);
<=====================include data to end of chunk
}
if (tmpLen < end) {
// will use on-disk checksum here since the end is a stable chunk
end = tmpLen;
} else if (chunkChecksum != null) {
// last chunk is changing. flag that we need to use in-memory
checksum
this.lastChunkChecksum = chunkChecksum;
}
}
endOffset = end;
{code}
{{endOffset}} is overwriten with {{end}}, which is only overwritten by
{{tmpLen}} when {{tmpLen < end}}.
Notice that {{end}} is
// end is either last byte on disk or the length for which we have a
// checksum
So in theory, the data sent from BlockSender is still valid data. Thus the
following statement is not true:
{quote}
(5) The last chunk (512 bytes) was not a full chunk, but the destination still
reserved the whole chunk in its buffer, and wrote the entire buffer to disk,
therefore some written data is garbage.
{quote}
That said, it's important for the receiving DN to have the accurate
visibleLength. That's the key issue here. Because the receiving DN got the
wrong visibleLength.
The way receiver got visible length is:
{code}
@Override // FsDatasetSpi
public synchronized ReplicaInPipeline convertTemporaryToRbw(
final ExtendedBlock b) throws IOException {
final long blockId = b.getBlockId();
final long expectedGs = b.getGenerationStamp();
final long visible = b.getNumBytes();
<=================================This is how the receiving DN got visible
length
LOG.info("Convert " + b + " from Temporary to RBW, visible length="
+ visible);
{code}
So I think the solution is for the sender to send only visibleLength worth
amount of data, instead of including data toward chunk end.
Quoted below is part of the above quoted code,
{code}
// Ensure endOffset points to end of chunk.
long tmpLen = startOffset + length;
if (tmpLen % chunkSize != 0) {
tmpLen += (chunkSize - tmpLen % chunkSize);
<=====================include data to end of chunk
}
{code}
Though the comment here says {{// Ensure endOffset points to end of chunk.}}, I
don't see why we need to do that. If we ensure the endOffset is equal to
startOffset + visibleLength (the {{length}} here is the visible length), then
we should solve the incorrect visibleLength issue at the receiver side, thus
the corruption issue.
I think this seems to be the solution, unless there is some real subtle reason
that we have to {{Ensure endOffset points to end of chunk.}}.
I hope some folks who are more familiar with this code can comment.
Thanks.
> Incorrect offset/length calculation in pipeline recovery causes block
> corruption
> --------------------------------------------------------------------------------
>
> Key: HDFS-10587
> URL: https://issues.apache.org/jira/browse/HDFS-10587
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode
> Reporter: Wei-Chiu Chuang
> Assignee: Wei-Chiu Chuang
>
> We found incorrect offset and length calculation in pipeline recovery may
> cause block corruption and results in missing blocks under a very unfortunate
> scenario.
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block,
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination
> still reserved the whole chunk in its buffer, and wrote the entire buffer to
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the
> replica from temporary to rbw, which made its visible length as the length of
> bytes on disk. That is to say, it thought whatever was transferred was
> acknowledged. However, the visible length of the replica is different (round
> up to the next multiple of 512) than the source of transfer.
> (7) Client then truncated the block in the attempt to remove unacknowledged
> data. However, because the visible length is equivalent of the bytes on disk,
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it
> wouldn’t tell NameNode to mark the replica as corrupt, so the client
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client
> nor the data node itself knows the replica is corrupt. When the restarted
> datanodes comes back, their replica are stale, despite they are not corrupt.
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent
> clusters.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]