[ 
https://issues.apache.org/jira/browse/HDFS-10587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15369328#comment-15369328
 ] 

Yongjun Zhang commented on HDFS-10587:
--------------------------------------

I did a quick change

{code}
diff --git 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
 hadoop-hdfs-project/hadoop-hdfs/src/main/ja
va/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
index 398935d..188768b 100644
--- 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
+++ 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockSender.java
@@ -363,6 +363,7 @@
       
       // Ensure read offset is position at the beginning of chunk
       offset = startOffset - (startOffset % chunkSize);
+      /*
       if (length >= 0) {
         // Ensure endOffset points to end of chunk.
         long tmpLen = startOffset + length;
@@ -378,7 +379,8 @@
         }
       }
       endOffset = end;
-
+      */
+      endOffset = length > 0? startOffset + length : end;
       // seek to the right offsets
       if (offset > 0 && checksumIn != null) {
         long checksumSkip = (offset / chunkSize) * checksumSize;
{code}
and run all HDFS/common unit tests, they passed fine.

Either we don't have a test to enforce {{// Ensure endOffset points to end of 
chunk.}} or it's ok not to have this enforcement.

If we don't need the enforcement, then the solution I would propose is to send 
{{length}} worth of data (where {{length}} is the visibleLength in this 
context) in BlockSender, as the quick change above illustrated.

So I'd suggest that we look more into whether we really need the above 
mentioned enforcement.

Thanks.


> Incorrect offset/length calculation in pipeline recovery causes block 
> corruption
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-10587
>                 URL: https://issues.apache.org/jira/browse/HDFS-10587
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>            Reporter: Wei-Chiu Chuang
>            Assignee: Wei-Chiu Chuang
>
> We found incorrect offset and length calculation in pipeline recovery may 
> cause block corruption and results in missing blocks under a very unfortunate 
> scenario. 
> (1) A client established pipeline and started writing data to the pipeline.
> (2) One of the data node in the pipeline restarted, closing the socket, and 
> some written data were unacknowledged.
> (3) Client replaced the failed data node with a new one, initiating block 
> transfer to copy existing data in the block to the new datanode.
> (4) The block is transferred to the new node. Crucially, the entire block, 
> including the unacknowledged data, was transferred.
> (5) The last chunk (512 bytes) was not a full chunk, but the destination 
> still reserved the whole chunk in its buffer, and wrote the entire buffer to 
> disk, therefore some written data is garbage.
> (6) When the transfer was done, the destination data node converted the 
> replica from temporary to rbw, which made its visible length as the length of 
> bytes on disk. That is to say, it thought whatever was transferred was 
> acknowledged. However, the visible length of the replica is different (round 
> up to the next multiple of 512) than the source of transfer.
> (7) Client then truncated the block in the attempt to remove unacknowledged 
> data. However, because the visible length is equivalent of the bytes on disk, 
> it did not truncate unacknowledged data.
> (8) When new data was appended to the destination, it skipped the bytes 
> already on disk. Therefore, whatever was written as garbage was not replaced.
> (9) the volume scanner detected corrupt replica, but due to HDFS-10512, it 
> wouldn’t tell NameNode to mark the replica as corrupt, so the client 
> continued to form a pipeline using the corrupt replica.
> (10) Finally the DN that had the only healthy replica was restarted. NameNode 
> then update the pipeline to only contain the corrupt replica.
> (11) Client continue to write to the corrupt replica, because neither client 
> nor the data node itself knows the replica is corrupt. When the restarted 
> datanodes comes back, their replica are stale, despite they are not corrupt. 
> Therefore, none of the replica is good and up to date.
> The sequence of events was reconstructed based on DataNode/NameNode log and 
> my understanding of code.
> Incidentally, we have observed the same sequence of events on two independent 
> clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to