smengcl commented on code in PR #10086: URL: https://github.com/apache/ozone/pull/10086#discussion_r3314550545
########## hadoop-hdds/docs/content/design/ec-decommission-reconstruct.md: ########## @@ -0,0 +1,191 @@ +--- +title: Speed up EC container decommission +summary: Transition from single-source replication to multi-source reconstruction for EC container decommission. +date: 2026-04-17 +jira: HDDS-15014 +status: draft +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> + +[HDDS-15014](https://issues.apache.org/jira/browse/HDDS-15014) Speed up EC container decommission + +# Problem: + +Decommissioning of dense datanode is, especially the datanodes of mostly EC containers, is very slow: + +[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-decommission](https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-decommission) + +``` +When we initiate the process of decommissioning, first we check the current state of the node, ideally it should be IN_SERVICE, then we change it's state to DECOMMISSIONING and start the process of decommissioning, it goes through a workflow where the following happens: + +1. First an event is fired to close any pipelines on the node, which will also close any containers. +2. Next the containers on the node are obtained and checked to see if new replicas are needed. If so, the new replicas are scheduled. +3. After scheduling replication, the node remains pending until replication has completed. +4. At this stage the node will complete the decommission process and the state of the node will be changed to DECOMMISSIONED. +``` + +EC container decommissioning is bottlenecked by the transfer speed of a single source datanode. + +[https://ozone.apache.org/docs/next/system-internals/replication/data/containers/replication#scenario-1-decommissioning](https://ozone.apache.org/docs/next/system-internals/replication/data/containers/replication#scenario-1-decommissioning) + +``` +**How it works:** + +1. **Detection**: ECUnderReplicationHandler detects containers with replicas on decommissioning Datanodes +2. **Index Identification**: The handler identifies which EC indexes are only present on decommissioning Datanodes (decommissioningOnlyIndexes()) +3. **One-to-One Replication**: For each decommissioning index, a replication command is created to copy that specific index to a new Datanode +4. **Target Selection**: New target Datanodes are selected based on placement policies +5. **Replication Execution**: Each index is replicated independently using the configured replication mode (push by default, configurable to pull via hdds.scm.replication.push) +``` + +[https://ozone.apache.org/docs/system-internals/replication/data/replication-manager/#ec-and-decommissioning](https://ozone.apache.org/docs/system-internals/replication/data/replication-manager/#ec-and-decommissioning) + +``` +For an EC container, the decommissioning host is likely the only source of the replica which needs to be copied and hence the decommission will be slower. +``` + +If the majority of the containers on the decommissioning datanode are EC containers, the network bandwidth or the aggregate disk speed of the datanode determines how long the decommission will require. + +# Node-level congestion + +Let’s say a cluster’s SLA dictates the decommission must complete within 8 hours. + +* Decommission of a 100TB datanode full of EC containers: + * Implying 3.45GB/s replication/reconstruction rate. + * 100TB datanode implies 12 x 8TB disks. + * Each disk has 150MB/s maximum throughput so 1.8GB/s throughput per datanode, and at least 25Gbps network bandwidth. + * Therefore, EC decommission cannot just replicate from source. It needs at least 2 target nodes too. + +* For 400TB datanodes: + * Implying 13.8GB/s replication/reconstruction rate. + * 400TB datanode implies 24 x 16TB disks. + * Each disk has 150MB/s maximum throughput so 3.6GB/s throughput per datanode, and at least 40Gbps network bandwidth. + * Therefore, EC decommission cannot just replicate from source. It needs at least 4 target nodes too. + * Assuming RS(3,2), to reconstruct a container requires reading 3x. Implying 13.8*3 = 41.4 GB/s bisectional bandwidth, which is at least 12 datanodes aggregated disk throughput. + * Assuming RS(6,3), to reconstruct a container requires reading 6x. Implying 13.8*6 = 82.8 GB/s bisectional bandwidth, which is at least 23 datanodes aggregated disk throughput. + +# Disk-level congestion + +In fact, because Ozone replication manager does not control I/O at disk level (only node level), multiple tasks may land at the same disk at the same time, and the single disk becomes the bottleneck for the entire datanode, and thus the enter cluster. + +* Enqueue more tasks at the same DN; change task pick up algorithm so Datanode create a X-way queue, + +Reconstruction is CPU intensive, network intensive and disk I/O intensive. It is therefore less efficient than re-replication, but reconstruction in a larger cluster improves parallelism and throughput. + +# Solutions + +## Solution 1 + +Simply force EC decommission to perform reconstruction, similar to under-constructed EC containers. + +## Solution 2 + +Start EC decommission with 1-1 replication. Once the number of in-flight decommission exceeds a pre-configured threshold (`hdds.scm.replication.decommission.ec.reconstruction.threshold`), Replication Manager schedules reconstruction tasks instead. + +This behavior can be enabled/disabled with a feature flag `hdds.scm.replication.decommission.ec.reconstruction.enabled` + +We will implement all three, using solution 1 as a baseline. Solution 2 and 3 provide more sophisticated balancing between overhead and parallelism. + +## Solution 3 + +This implementation plan outlines the transition from single-source replication to multi-source reconstruction for EC container decommission, as described in HDDS-15014. The plan focuses on SCM-side dynamic switching and Datanode-side disk-level fairness. + +### Phase 1: SCM Configuration and Global Capacity +We will introduce new configuration properties in `ReplicationManagerConfiguration` to control the behavior and protect cluster resources. + +1. **New Configuration Keys:** + * `hdds.scm.replication.decommission.ec.reconstruction.enabled` (Boolean, default: false): Feature flag to enable/disable the switch to reconstruction during decommission. Review Comment: Which DN will act as the client when performing EC reconstruction? The source DN that has one of the stripes? Or the target DN that will receive the reconstructed block -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
