Re: [PR] HDDS-15014. Design doc for Speed up EC container decommission [ozone]

via GitHub Wed, 27 May 2026 17:00:49 -0700


smengcl commented on code in PR #10086:
URL: https://github.com/apache/ozone/pull/10086#discussion_r3314550545



##########
hadoop-hdds/docs/content/design/ec-decommission-reconstruct.md:
##########
@@ -0,0 +1,191 @@
+---
+title: Speed up EC container decommission
+summary: Transition from single-source replication to multi-source 
reconstruction for EC container decommission.
+date: 2026-04-17
+jira: HDDS-15014
+status: draft
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+[HDDS-15014](https://issues.apache.org/jira/browse/HDDS-15014) Speed up EC 
container decommission
+
+# Problem:
+
+Decommissioning of dense datanode is, especially the datanodes of mostly EC 
containers, is very slow:
+
+[https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-decommission](https://ozone.apache.org/docs/next/administrator-guide/operations/node-decommissioning-and-maintenance/datanodes/datanode-decommission)
+
+```
+When we initiate the process of decommissioning, first we check the current 
state of the node, ideally it should be IN_SERVICE, then we change it's state 
to DECOMMISSIONING and start the process of decommissioning, it goes through a 
workflow where the following happens:
+
+1. First an event is fired to close any pipelines on the node, which will also 
close any containers.  
+2. Next the containers on the node are obtained and checked to see if new 
replicas are needed. If so, the new replicas are scheduled.  
+3. After scheduling replication, the node remains pending until replication 
has completed.  
+4. At this stage the node will complete the decommission process and the state 
of the node will be changed to DECOMMISSIONED.
+```
+
+EC container decommissioning is bottlenecked by the transfer speed of a single 
source datanode.
+
+[https://ozone.apache.org/docs/next/system-internals/replication/data/containers/replication#scenario-1-decommissioning](https://ozone.apache.org/docs/next/system-internals/replication/data/containers/replication#scenario-1-decommissioning)
+
+```
+**How it works:**
+
+1. **Detection**: ECUnderReplicationHandler detects containers with replicas 
on decommissioning Datanodes  
+2. **Index Identification**: The handler identifies which EC indexes are only 
present on decommissioning Datanodes (decommissioningOnlyIndexes())  
+3. **One-to-One Replication**: For each decommissioning index, a replication 
command is created to copy that specific index to a new Datanode  
+4. **Target Selection**: New target Datanodes are selected based on placement 
policies  
+5. **Replication Execution**: Each index is replicated independently using the 
configured replication mode (push by default, configurable to pull via 
hdds.scm.replication.push)
+```
+
+[https://ozone.apache.org/docs/system-internals/replication/data/replication-manager/#ec-and-decommissioning](https://ozone.apache.org/docs/system-internals/replication/data/replication-manager/#ec-and-decommissioning)
  
+
+```
+For an EC container, the decommissioning host is likely the only source of the 
replica which needs to be copied and hence the decommission will be slower.
+```
+
+If the majority of the containers on the decommissioning datanode are EC 
containers, the network bandwidth or the aggregate disk speed of the datanode 
determines how long the decommission will require.
+
+# Node-level congestion
+
+Let’s say a cluster’s SLA dictates the decommission must complete within 8 
hours. 
+
+* Decommission of a 100TB datanode full of EC containers:  
+  * Implying 3.45GB/s replication/reconstruction rate.  
+  * 100TB datanode implies 12 x 8TB disks.  
+  * Each disk has 150MB/s maximum throughput so 1.8GB/s throughput per 
datanode, and at least 25Gbps network bandwidth.  
+  * Therefore, EC decommission cannot just replicate from source. It needs at 
least 2 target nodes too.
+
+* For 400TB datanodes:  
+  * Implying 13.8GB/s replication/reconstruction rate.  
+  * 400TB datanode implies 24 x 16TB disks.  
+  * Each disk has 150MB/s maximum throughput so 3.6GB/s throughput per 
datanode, and at least 40Gbps network bandwidth.  
+  * Therefore, EC decommission cannot just replicate from source. It needs at 
least 4 target nodes too.  
+  * Assuming RS(3,2), to reconstruct a container requires reading 3x. Implying 
13.8*3 = 41.4 GB/s bisectional bandwidth, which is at least 12 datanodes 
aggregated disk throughput.  
+  * Assuming RS(6,3), to reconstruct a container requires reading 6x. Implying 
13.8*6 = 82.8 GB/s bisectional bandwidth, which is at least 23 datanodes 
aggregated disk throughput.
+
+# Disk-level congestion
+
+In fact, because Ozone replication manager does not control I/O at disk level 
(only node level), multiple tasks may land at the same disk at the same time, 
and the single disk becomes the bottleneck for the entire datanode, and thus 
the enter cluster.
+
+* Enqueue more tasks at the same DN; change task pick up algorithm so Datanode 
create a X-way queue,  
+
+Reconstruction is CPU intensive, network intensive and disk I/O intensive. It 
is therefore less efficient than re-replication, but reconstruction in a larger 
cluster improves parallelism and throughput.
+
+# Solutions
+
+## Solution 1
+
+Simply force EC decommission to perform reconstruction, similar to 
under-constructed EC containers.
+
+## Solution 2
+
+Start EC decommission with 1-1 replication. Once the number of in-flight 
decommission exceeds a pre-configured threshold 
(`hdds.scm.replication.decommission.ec.reconstruction.threshold`), Replication 
Manager schedules reconstruction tasks instead.
+
+This behavior can be enabled/disabled with a feature flag 
`hdds.scm.replication.decommission.ec.reconstruction.enabled`
+
+We will implement all three, using solution 1 as a baseline. Solution 2 and 3 
provide more sophisticated balancing between overhead and parallelism.
+
+## Solution 3
+
+This implementation plan outlines the transition from single-source 
replication to multi-source reconstruction for EC container decommission, as 
described in HDDS-15014. The plan focuses on SCM-side dynamic switching and 
Datanode-side disk-level fairness.
+
+### Phase 1: SCM Configuration and Global Capacity
+We will introduce new configuration properties in 
`ReplicationManagerConfiguration` to control the behavior and protect cluster 
resources.
+
+1.  **New Configuration Keys:**
+    *   `hdds.scm.replication.decommission.ec.reconstruction.enabled` 
(Boolean, default: false): Feature flag to enable/disable the switch to 
reconstruction during decommission.

Review Comment:
   Which DN will act as the client when performing EC reconstruction?
   
   The source DN that has one of the stripes? Or the target DN that will 
receive the reconstructed block



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15014. Design doc for Speed up EC container decommission [ozone]

Reply via email to