Re: [PR] HDDS-12598. [DiskBalancer] Add design and feature document [ozone]

via GitHub Mon, 21 Jul 2025 17:44:15 -0700


Gargi-jais11 commented on code in PR #8837:
URL: https://github.com/apache/ozone/pull/8837#discussion_r2220687045



##########
hadoop-hdds/docs/content/design/diskbalancer.md:
##########
@@ -0,0 +1,136 @@
+---
+title: "DiskBalancer for Datanode"
+summary: "DiskBalancer is a feature to evenly distribute data across all disks 
within a Datanode for even disk utilisation."
+date: 2025-07-21
+jira: HDDS-5713
+status: implementing
+author: Janus Chow, Sammi Chen, Gargi Jaiswal, Stephen O' Donnell
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+# [HDDS-5713](https://issues.apache.org/jira/browse/HDDS-5713) DiskBalancer 
for Datanode (implementing)
+
+## Background
+**Apache Ozone** works well to distribute all containers evenly
+across all multiple disks on each Datanode. This initial spread
+ensures that I/O load is balanced from the start. However,
+over the operational lifetime of a cluster **disk imbalance** can
+occur due to the following reasons:
+- **Adding new disks** to expand datanode storage space.
+- **Replacing old broken disks** with new disks.
+- Massive **block** or **replica deletions**.
+
+This uneven utilisation of disks can create performance bottlenecks, as
+**over-utilised disks** become **hotspots** limiting the overall throughput of 
the
+Datanode. As a result, this new feature, **DiskBalancer**, is introduced to
+ensure even data distribution across disks within a Datanode.
+
+## Proposed Solution
+The DiskBalancer is a feature which evenly distributes data across
+different disks of a Datanode.
+
+It detects an imbalance within datanode, using the term from
+[HDFS 
DiskBalancer](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html)
+metric called **Volume Data Density**. This metric is calculated for
+each disk using following formula:
+
+```
+AverageUtilization = TotalUsed / TotalCapacity
+VolumeUtilization = diskUsedSpace / diskCapacity
+VolumeDensity = | VolumeUtilization - AverageUtilization |
+```
+Here, **VolumeUtilization** is each disk's individual utilization and
+**AverageUtilization** is the ideal utilization for all disks to maintain
+eveness.
+
+A disk is considered a candidate for balancing if its `VolumeDataDensity` 
exceeds a configurable
+`threshold`. The DiskBalancer then moves the containers from most
+utilised disk to the least utilised disk. DiskBalancer can be triggered 
manually by **CLI commands**.
+
+## High-Level DiskBalancer Implementation
+
+The general view of this design consists of 3 parts as follows:
+
+**Client & SCM -:**
+
+Administrators use the `ozone admin datanode diskbalancer` CLI to manage and 
monitor the feature.
+* Clients can control the DiskBalancer job by sending requests to SCM, like 
start,
+stop, update configuration and can query for diskBalancer status.
+* Clients get storageReport from SCM to decide which datanode to balance.
+
+**SCM & DN -:**
+
+SCM acts as a **control plane** and information hub but remains **stateless** 
+regarding the balancing process.
+* SCM retrieves storageReport and balance status via heartbeat from DN.
+
+**DN -:**
+
+All balance operations are done in dataNodes. 
+
+A daemon thread, the **Scheduler**, runs periodically on each Datanode.
+1.  It calculates the `VolumeDataDensity` for all volumes.
+2.  If an imbalance is detected (i.e., density > threshold), it moves a set of 
closed containers
+from the most over-utilized disk (source) to the least utilized disk 
(destination).
+3.  The scheduler dispatches these move tasks to a pool of **Worker** threads 
for parallel execution.
+
+## Container Move Process
+
+Suppose, we are moving container C1 **(CLOSED state)** from Source Disk d1 to 
Destination disk d2 :
+1. A temporary copy, `Temp C1-CLOSED`, is created in the `temp directory` of 
the destination disk D2.
+
+2. `Temp C1-CLOSED` is transitioned to `Temp C1-RECOVERING` state. This **Temp 
C1-RECOVERING** container is now
+atomically moved to the **final destination** directory of D2 as 
`C1-RECOVERING`.
+3. Now **new container** import is initiated for `C1-RECOVERING` container.
+4. Once the import is successful, all the metadata updates are done for this 
new container created on D2.
+5. Finally, the original container `C1-CLOSED` on D1 is deleted.
+
+```
+D1     ----> C1-CLOSED  --- (5) ---> C1-DELETED
+        |
+        |
+       (1)
+        |
+D2      ----> Temp C1-CLOSED  --- (2) ---> Temp C1-RECOVERING --- (3) ---> 
C1-RECOVERING --- (4) ---> C1-CLOSED

Review Comment:
   This new container move process implementation is in this Active 
[PR](https://github.com/apache/ozone/pull/8693 ). This is soon to be merged.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-12598. [DiskBalancer] Add design and feature document [ozone]

Reply via email to