[ 
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-16874:
------------------------------
    Target Version/s: 3.5.0  (was: 3.4.1)

> Improve DataNode decommission for Erasure Coding
> ------------------------------------------------
>
>                 Key: HDFS-16874
>                 URL: https://issues.apache.org/jira/browse/HDFS-16874
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: ec, erasure-coding
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>            Priority: Major
>
> There are a couple of issues with the current DataNode decommission 
> implementation when large amounts of Erasure Coding data are involved in the 
> data re-replication/reconstruction process:
>  # Slowness. In HDFS-8786 we made a decision to use re-replication for 
> DataNode decommission if the internal EC block is still available. While this 
> strategy reduces the CPU cost caused by EC reconstruction, it greatly limits 
> the overall data recovery bandwidth, since there is only one single DataNode 
> as the source. While high density HDD hosts are more and more widely used by 
> HDFS especially along with Erasure Coding for warm data use case, this 
> becomes a big pain for cluster management. In our production, to decommission 
> a DataNode with several hundred TB EC data stored might take several days. 
> HDFS-16613 provides optimization based on the existing mechanism, but more 
> fundamentally we may want to allow EC reconstruction for DataNode 
> decommission so as to achieve much larger recovery bandwidth.
>  # The semantic of the existing EC reconstruction command (the 
> BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The 
> existing reconstruction command depends on the holes in the 
> srcNodes/liveBlockIndices arrays to indicate the target internal blocks for 
> recovery, while the holes can also be caused by the fact that the 
> corresponding datanode is too busy so it cannot be used as the reconstruction 
> source. This causes the later DataNode side reconstruction may not be 
> consistent with the original intention. E.g., if the index of the missing 
> block is 6, and the datanode storing block 0 is busy, the src nodes in the 
> reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target 
> datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is 
> working on this issue by indicating an excluding index list. More 
> fundamentally we can follow the same path but go a step further by adding an 
> optional field explicitly indicating the target block indices in the command 
> protobuf msg. With the extension the DataNode will no longer use the holes in 
> the src node array to "guess" the reconstruction targets.
> Internally we have developed and applied fixes by following the above 
> directions. We have seen significant improvement (100+ times speed up) in 
> terms of datanode decommission speed for EC data. The more clear semantic of 
> the reconstruction command protobuf msg also help prevent potential data 
> corruption during the EC reconstruction.
> We will use this ticket to track the similar fixes for the Apache releases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to