[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

ASF GitHub Bot (Jira) Tue, 18 Jul 2023 03:31:12 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17744159#comment-17744159
 ]


ASF GitHub Bot commented on HDFS-17094:
---------------------------------------

zhangshuyan0 opened a new pull request, #5854:
URL: https://github.com/apache/hadoop/pull/5854

   ### Description of PR
   When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
`rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
correspondence. However, if there are locations in stale state when NameNode 
handles heartbeat, this correspondence will be disrupted. In detail, there is 
no stale location in `recoveryLocations`, but the block indices array is still 
complete (i.e. contains the indices of all the locations). 
   
https://github.com/apache/hadoop/blob/c44823dadb73a3033f515329f70b2e3126fcb7be/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1720-L1724
   
https://github.com/apache/hadoop/blob/c44823dadb73a3033f515329f70b2e3126fcb7be/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1754-L1757
   This will cause `BlockRecoveryWorker.RecoveryTaskStriped#recover()` to 
generate a wrong internal block ID, and the corresponding datanode cannot find 
the replica, thus making the recovery process fail. 
   
https://github.com/apache/hadoop/blob/c44823dadb73a3033f515329f70b2e3126fcb7be/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L407-L416
   This bug needs to be fixed.
   
   ### How was this patch tested?
   Add a new unit test.
   
   




> EC: Fix bug in block recovery when there are stale datanodes
> ------------------------------------------------------------
>
>                 Key: HDFS-17094
>                 URL: https://issues.apache.org/jira/browse/HDFS-17094
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Shuyan Zhang
>            Priority: Major
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the relica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

Reply via email to