[
https://issues.apache.org/jira/browse/HDFS-16874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shilun Fan updated HDFS-16874:
------------------------------
Target Version/s: 3.5.0 (was: 3.4.1)
> Improve DataNode decommission for Erasure Coding
> ------------------------------------------------
>
> Key: HDFS-16874
> URL: https://issues.apache.org/jira/browse/HDFS-16874
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: ec, erasure-coding
> Reporter: Jing Zhao
> Assignee: Jing Zhao
> Priority: Major
>
> There are a couple of issues with the current DataNode decommission
> implementation when large amounts of Erasure Coding data are involved in the
> data re-replication/reconstruction process:
> # Slowness. In HDFS-8786 we made a decision to use re-replication for
> DataNode decommission if the internal EC block is still available. While this
> strategy reduces the CPU cost caused by EC reconstruction, it greatly limits
> the overall data recovery bandwidth, since there is only one single DataNode
> as the source. While high density HDD hosts are more and more widely used by
> HDFS especially along with Erasure Coding for warm data use case, this
> becomes a big pain for cluster management. In our production, to decommission
> a DataNode with several hundred TB EC data stored might take several days.
> HDFS-16613 provides optimization based on the existing mechanism, but more
> fundamentally we may want to allow EC reconstruction for DataNode
> decommission so as to achieve much larger recovery bandwidth.
> # The semantic of the existing EC reconstruction command (the
> BlockECReconstructionInfoProto msg sent from NN to DN) is not clear. The
> existing reconstruction command depends on the holes in the
> srcNodes/liveBlockIndices arrays to indicate the target internal blocks for
> recovery, while the holes can also be caused by the fact that the
> corresponding datanode is too busy so it cannot be used as the reconstruction
> source. This causes the later DataNode side reconstruction may not be
> consistent with the original intention. E.g., if the index of the missing
> block is 6, and the datanode storing block 0 is busy, the src nodes in the
> reconstruction command only cover blocks [1, 2, 3, 4, 5, 7, 8]. The target
> datanode may reconstruct the internal block 0 instead of 6. HDFS-16566 is
> working on this issue by indicating an excluding index list. More
> fundamentally we can follow the same path but go a step further by adding an
> optional field explicitly indicating the target block indices in the command
> protobuf msg. With the extension the DataNode will no longer use the holes in
> the src node array to "guess" the reconstruction targets.
> Internally we have developed and applied fixes by following the above
> directions. We have seen significant improvement (100+ times speed up) in
> terms of datanode decommission speed for EC data. The more clear semantic of
> the reconstruction command protobuf msg also help prevent potential data
> corruption during the EC reconstruction.
> We will use this ticket to track the similar fixes for the Apache releases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]