(ozone) branch HDDS-5713 updated: HDDS-12598. [DiskBalancer] Add design and feature document (#8837)

weichiu Wed, 23 Jul 2025 00:30:37 -0700

This is an automated email from the ASF dual-hosted git repository.

weichiu pushed a commit to branch HDDS-5713
in repository https://gitbox.apache.org/repos/asf/ozone.git



The following commit(s) were added to refs/heads/HDDS-5713 by this push:
     new 2334f41049 HDDS-12598. [DiskBalancer] Add design and feature document 
(#8837)
2334f41049 is described below

commit 2334f4104972084cf8beed22f47d744e468abf48
Author: Gargi Jaiswal <134698352+gargi-jai...@users.noreply.github.com>
AuthorDate: Wed Jul 23 12:59:57 2025 +0530

    HDDS-12598. [DiskBalancer] Add design and feature document (#8837)
---
 hadoop-hdds/docs/content/design/diskbalancer.md    | 136 +++++++++++++++++++++
 hadoop-hdds/docs/content/feature/DiskBalancer.md   | 122 ++++++++++++++++++
 .../docs/content/feature/DiskBalancer.zh.md        | 116 ++++++++++++++++++
 hadoop-hdds/docs/content/feature/diskBalancer.png  | Bin 0 -> 116124 bytes
 4 files changed, 374 insertions(+)

diff --git a/hadoop-hdds/docs/content/design/diskbalancer.md 
b/hadoop-hdds/docs/content/design/diskbalancer.md
new file mode 100644
index 0000000000..5121631c03
--- /dev/null
+++ b/hadoop-hdds/docs/content/design/diskbalancer.md
@@ -0,0 +1,136 @@
+---
+title: "DiskBalancer for Datanode"
+summary: "DiskBalancer is a feature to evenly distribute data across all disks 
within a Datanode for even disk utilisation."
+date: 2025-07-21
+jira: HDDS-5713
+status: implementing
+author: Janus Chow, Sammi Chen, Gargi Jaiswal, Stephen O' Donnell
+---
+<!--
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+   http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+# [HDDS-5713](https://issues.apache.org/jira/browse/HDDS-5713) DiskBalancer 
for Datanode (implementing)
+
+## Background
+**Apache Ozone** works well to distribute all containers evenly
+across all multiple disks on each Datanode. This initial spread
+ensures that I/O load is balanced from the start. However,
+over the operational lifetime of a cluster **disk imbalance** can
+occur due to the following reasons:
+- **Adding new disks** to expand datanode storage space.
+- **Replacing old broken disks** with new disks.
+- Massive **block** or **replica deletions**.
+
+This uneven utilisation of disks can create performance bottlenecks, as
+**over-utilised disks** become **hotspots** limiting the overall throughput of 
the
+Datanode. As a result, this new feature, **DiskBalancer**, is introduced to
+ensure even data distribution across disks within a Datanode.
+
+## Proposed Solution
+The DiskBalancer is a feature which evenly distributes data across
+different disks of a Datanode.
+
+It detects an imbalance within datanode, using the term from
+[HDFS 
DiskBalancer](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html)
+metric called **Volume Data Density**. This metric is calculated for
+each disk using following formula:
+
+```
+AverageUtilization = TotalUsed / TotalCapacity
+VolumeUtilization = diskUsedSpace / diskCapacity
+VolumeDensity = | VolumeUtilization - AverageUtilization |
+```
+Here, **VolumeUtilization** is each disk's individual utilization and
+**AverageUtilization** is the ideal utilization for all disks to maintain
+eveness.
+
+A disk is considered a candidate for balancing if its `VolumeDataDensity` 
exceeds a configurable
+`threshold`. The DiskBalancer then moves the containers from most
+utilised disk to the least utilised disk. DiskBalancer can be triggered 
manually by **CLI commands**.
+
+## High-Level DiskBalancer Implementation
+
+The general view of this design consists of 3 parts as follows:
+
+**Client & SCM -:**
+
+Administrators use the `ozone admin datanode diskbalancer` CLI to manage and 
monitor the feature.
+* Clients can control the DiskBalancer job by sending requests to SCM, like 
start,
+stop, update configuration and can query for diskBalancer status.
+* Clients get storageReport from SCM to decide which datanode to balance.
+
+**SCM & DN -:**
+
+SCM acts as a **control plane** and information hub but remains **stateless** 
+regarding the balancing process.
+* SCM retrieves storageReport and balance status via heartbeat from DN.
+
+**DN -:**
+
+All balance operations are done in dataNodes. 
+
+A daemon thread, the **Scheduler**, runs periodically on each Datanode.
+1.  It calculates the `VolumeDataDensity` for all volumes.
+2.  If an imbalance is detected (i.e., density > threshold), it moves a set of 
closed containers
+from the most over-utilized disk (source) to the least utilized disk 
(destination).
+3.  The scheduler dispatches these move tasks to a pool of **Worker** threads 
for parallel execution.
+
+## Container Move Process
+
+Suppose, we are moving container C1 **(CLOSED state)** from Source Disk d1 to 
Destination disk d2 :
+1. A temporary copy, `Temp C1-CLOSED`, is created in the `temp directory` of 
the destination disk D2.
+
+2. `Temp C1-CLOSED` is transitioned to `Temp C1-RECOVERING` state. This **Temp 
C1-RECOVERING** container is now
+atomically moved to the **final destination** directory of D2 as 
`C1-RECOVERING`.
+3. Now **new container** import is initiated for `C1-RECOVERING` container.
+4. Once the import is successful, all the metadata updates are done for this 
new container created on D2.
+5. Finally, the original container `C1-CLOSED` on D1 is deleted.
+
+```
+D1     ----> C1-CLOSED  --- (5) ---> C1-DELETED
+        |
+        |
+       (1)
+        |
+D2      ----> Temp C1-CLOSED  --- (2) ---> Temp C1-RECOVERING --- (3) ---> 
C1-RECOVERING --- (4) ---> C1-CLOSED
+```
+## DiskBalancing Policies
+
+By default, the DiskBalancer uses specific policies to decide which disks to 
balance and which containers to move. These
+are configurable, but the default implementations provide robust and safe 
behavior.
+
+*   **`DefaultVolumeChoosingPolicy`**: This is the default policy for 
selecting the source and destination volumes. It 
+identifies the most over-utilized volume as the source and the most 
under-utilized volume as the destination by comparing
+each volume's utilization against the Datanode's average. The calculation is 
smart enough to account for data that is 
+already in the process of being moved, ensuring it makes accurate decisions 
based on the future state of the volumes.
+
+*   **`DefaultContainerChoosingPolicy`**: This is the default policy for 
selecting which container to move from a source
+volume. It iterates through the containers on the source disk and picks the 
first one that is in a **CLOSED** state 
+and is not already being moved by another balancing operation. To optimize 
performance and avoid re-scanning the same 
+containers repeatedly, it caches the list of containers for each volume which 
auto expires after one hour of its last 
+used time or if the container iterator for that is invalidated on full 
utilisation.
+
+## DiskBalancer Metrics
+
+The DiskBalancer service exposes JMX metrics on each Datanode for real-time 
monitoring. These metrics provide insights
+into the balancer's activity, progress, and overall health.
+
+| DiskBalancer Service Metrics             | Description                       
                                                                                
 |                                                                              
                                                                               
+|------------------------------------------|--------------------------------------------------------------------------------------------------------------------|
+| `SuccessCount`                           | The number  of successful balance 
jobs.                                                                           
 | 
+| `SuccessBytes`                           | The total bytes for successfully 
balanced jobs.                                                                  
  | 
+| `FailureCount`                           | The number of failed balance 
jobs.                                                                           
      |
+| `moveSuccessTime`                        | The time spent on successful 
container moves.                                                                
      |
+| `moveFailureTime`                        | The time spent on failed 
container moves.                                                                
          |
+| `runningLoopCount`                       | The total number of times the 
balancer's main loop has run.                                                   
     |
+| `idleLoopNoAvailableVolumePairCount `    | The number of loops where 
balancing did not run because no suitable source/destination volume pair could 
be found. |
+| `idleLoopExceedsBandwidthCount`          | The number of loops where 
balancing did not run due to bandwidth limits.                                  
         |
+
diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.md 
b/hadoop-hdds/docs/content/feature/DiskBalancer.md
new file mode 100644
index 0000000000..6ef86ec965
--- /dev/null
+++ b/hadoop-hdds/docs/content/feature/DiskBalancer.md
@@ -0,0 +1,122 @@
+---
+title: "DiskBalancer"
+weight: 1
+menu:
+   main:
+      parent: Features
+summary: DiskBalancer For DataNodes.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## Overview
+**Apache Ozone** works well to distribute all containers evenly across all 
multiple disks on each Datanode.
+This initial spread ensures that I/O load is balanced from the start. However, 
over the operational lifetime of a
+cluster **disk imbalance** can occur due to the following reasons:
+- **Adding new disks** to expand datanode storage space.
+- **Replacing old broken disks** with new disks.
+- Massive **block** or **replica deletions**.
+
+This uneven utilisation of disks can create performance bottlenecks, as 
**over-utilised disks** become **hotspots** 
+limiting the overall throughput of the Datanode. As a result, this new 
feature, **DiskBalancer**, is introduced to
+ensure even data distribution across disks within a Datanode.
+
+A disk is considered a candidate for balancing if its
+`VolumeDataDensity` exceeds a configurable `threshold`. DiskBalancer can be 
triggered manually by **CLI commands**.
+
+
+![Data spread across disks](diskBalancer.png)
+
+## Command Line Usage
+The DiskBalancer is managed through the `ozone admin datanode diskbalancer` 
command.
+
+### **Start DiskBalancer**
+To start diskBalancer on all Datanodes with default configurations :
+
+```shell
+ozone admin datanode diskbalancer start -a
+```
+
+You can also start DiskBalancer with specific options:
+
+```shell
+ozone admin datanode diskbalancer start [options]
+```
+
+### **Update Configurations**
+To update DiskBalancer configurations you can use following command:
+
+```shell
+ozone admin datanode diskbalancer update [options]
+```
+**Options include:**
+
+| Options                               | Description                          
                                                                 |              
                                                                                
                                                               
+|---------------------------------------|-------------------------------------------------------------------------------------------------------|
+| `-t, --threshold`                     | Percentage deviation from average 
utilization of the disks after which a datanode will be rebalanced. |
+| `-b, --bandwithInMB`                  | Maximum bandwidth for DiskBalancer 
per second.                                                        |
+| `-p, --parallelThread`                | Max parallelThread for DiskBalancer. 
                                                                 |
+| `-s, --stop-after-disk-even`          | Stop DiskBalancer automatically 
after disk utilization is even.                                       |
+| `-a, --all`                           | Run commands on all datanodes.       
                                                                 |
+| `-d, --datanodes`                     | Run commands on specific datanodes   
                                                                 |
+
+### **Stop DiskBalancer**
+To stop DiskBalancer on all Datanodes:
+
+```shell
+ozone admin datanode diskbalancer stop -a
+```
+You can also stop DiskBalancer on specific Datanodes:
+
+```shell
+ozone admin datanode diskbalancer stop -d <datanode1>
+```
+### **DiskBalancer Status**
+To check the status of DiskBalancer on all Datanodes:
+
+```shell
+ozone admin datanode diskbalancer status
+```
+You can also check status of DiskBalancer on specific Datanodes:
+
+```shell
+ozone admin datanode diskbalancer status -d <datanode1>
+```
+### **DiskBalancer Report**
+To get a **volumeDataDensity** of DiskBalancer of top **N** Datanodes 
(displayed in descending order),
+by default N=25, if not specified:
+
+```shell
+ozone admin datanode diskbalancer report --count <N>
+```
+
+## **DiskBalancer Configurations**
+
+The DiskBalancer's behavior can be controlled using the following 
configuration properties in `ozone-site.xml`.
+
+| Property                                                     | Default Value 
                                                                         | 
Description                                                                     
                                                                                
             |
+| ------------------------------------------------------------ 
|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `hdds.datanode.disk.balancer.volume.density.threshold`       | `10.0`        
                                                                         | A 
percentage (0-100). A datanode is considered balanced if for each volume, its 
utilization differs from the average datanode utilization by no more than this 
threshold.    |
+| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec`  | `10`          
                                                                         | The 
maximum bandwidth (in MB/s) that the balancer can use for moving data, to avoid 
impacting client I/O.                                                           
         |
+| `hdds.datanode.disk.balancer.parallel.thread`                | `5`           
                                                                         | The 
number of worker threads to use for moving containers in parallel.              
                                                                                
         |
+| `hdds.datanode.disk.balancer.service.interval`               | `60s`         
                                                                         | The 
time interval at which the Datanode DiskBalancer service checks for imbalance 
and updates its configuration.                                                  
           |
+| `hdds.datanode.disk.balancer.stop.after.disk.even`           | `true`        
                                                                         | If 
true, the DiskBalancer will automatically stop its balancing activity once 
disks are considered balanced (i.e., all volume densities are within the 
threshold).           |
+| `hdds.datanode.disk.balancer.volume.choosing.policy`         | 
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy`
    | The policy class for selecting source and destination volumes for 
balancing.                                                                      
                           |
+| `hdds.datanode.disk.balancer.container.choosing.policy`      | 
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy`
 | The policy class for selecting which containers to move from a source volume 
to destination volume.                                                          
                |
+| `hdds.datanode.disk.balancer.service.timeout`                | `300s`        
                                                                         | 
Timeout for the Datanode DiskBalancer service operations.                       
                                                                                
             |
+| `hdds.datanode.disk.balancer.should.run.default`             | `false`       
                                                                         | If 
the balancer fails to read its persisted configuration, this value determines 
if the service should run by default.                                           
            |
+
diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md 
b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md
new file mode 100644
index 0000000000..a44fb07cd7
--- /dev/null
+++ b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md
@@ -0,0 +1,116 @@
+---
+title: "磁盘均衡器"
+weight: 1
+menu:
+   main:
+      parent: 特征
+summary: 数据节点的磁盘平衡器.
+---
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+## 概述
+**Apache Ozone** 能够有效地将所有容器均匀分布在每个 Datanode 的多个磁盘上。这种初始分布确保了 I/O 
负载从一开始就保持平衡。然而，
+在集群的整个运行生命周期中，以下原因可能会导致磁盘负载不平衡：
+- **添加新磁盘**以扩展数据节点存储空间。
+- **用新磁盘**替换损坏的旧磁盘**。
+- 大量**块**或**副本删除**。
+
+磁盘利用率不均衡会造成性能瓶颈，因为**过度利用的磁盘**会成为**热点**，限制 Datanode 
的整体吞吐量。因此，我们引入了这项新功能**磁盘平衡器**
+，以确保 Datanode 内各个磁盘的数据均匀分布。
+
+如果磁盘的`VolumeDataDensity`超过可配置的“阈值”，则该磁盘被视为需要进行平衡。DiskBalancer 可以通过**CLI 
命令**手动触发。
+
+![Disk Even](diskBalancer.png)
+
+## 命令行用法
+DiskBalancer 通过 `ozone admin datanode diskbalancer` 命令进行管理。
+
+### **启动 DiskBalancer**
+要在所有 Datanode 上使用默认配置启动 DiskBalancer，请执行以下操作：
+
+```shell
+ozone admin datanode diskbalancer start -a
+```
+
+您还可以使用特定选项启动 DiskBalancer：
+```shell
+ozone admin datanode diskbalancer start [options]
+```
+
+### **更新配置**
+要更新 DiskBalancer 配置，您可以使用以下命令：
+
+```shell
+ozone admin datanode diskbalancer update [options]
+```
+**选项包括：**
+
+| 选项                           | 描述                              |             
                                                                                
                                                                
+|------------------------------|---------------------------------|
+| `-t, --threshold`            | 与磁盘平均利用率的百分比偏差，超过此偏差，数据节点将重新平衡。 |
+| `-b, --bandwithInMB`         | DiskBalancer 每秒的最大带宽。           |
+| `-p, --parallelThread`       | DiskBalancer 的最大并行线程。           |
+| `-s, --stop-after-disk-even` | 磁盘利用率达到均匀后自动停止 DiskBalancer。    |
+| `-a, --all`                  | 在所有数据节点上运行命令。                   |
+| `-d, --datanodes`            | 在特定数据节点上运行命令                    |
+
+### **停止 DiskBalancer**
+要停止所有 Datanode 上的 DiskBalancer，请执行以下操作：
+
+```shell
+ozone admin datanode diskbalancer stop -a
+```
+您还可以停止特定 Datanode 上的 DiskBalancer：
+
+```shell
+ozone admin datanode diskbalancer stop -d <datanode1>
+```
+### **磁盘平衡器状态**
+要检查所有数据节点上的磁盘平衡器状态，请执行以下操作：
+
+```shell
+ozone admin datanode diskbalancer status
+```
+您还可以检查特定 Datanode 上 DiskBalancer 的状态：
+```shell
+ozone admin datanode diskbalancer status -d <datanode1>
+```
+### **磁盘平衡器报告**
+要获取前**N**个数据节点（按降序显示）的磁盘平衡器**volumeDataDensity**，
+默认 N=25，如未指定：
+
+```shell
+ozone admin datanode diskbalancer report --count <N>
+```
+
+## DiskBalancer Configurations
+
+The DiskBalancer's behavior can be controlled using the following 
configuration properties in `ozone-site.xml`.
+
+| Property                                                     | Default Value 
                         | Description                                          
                                                                                
                                        |
+| ------------------------------------------------------------ 
|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `hdds.datanode.disk.balancer.volume.density.threshold`       | `10.0`        
                         | 
百分比（0-100）。如果对于每个卷，其利用率与平均数据节点利用率之差不超过此阈值，则认为数据节点处于平衡状态。    |
+| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec`  | `10`          
                         | 平衡器可用于移动数据的最大带宽（以 MB/s 为单位），以避免影响客户端 I/O。            
                                                        |
+| `hdds.datanode.disk.balancer.parallel.thread`                | `5`           
                         | 用于并行移动容器的工作线程数。                                      
                                                                 |
+| `hdds.datanode.disk.balancer.service.interval`               | `60s`         
                         | Datanode DiskBalancer 服务检查不平衡并更新其配置的时间间隔。            
                                                 |
+| `hdds.datanode.disk.balancer.stop.after.disk.even`           | `true`        
                         | 如果为真，则一旦磁盘被视为平衡（即所有卷密度都在阈值内），DiskBalancer 
将自动停止其平衡活动。           |
+| `hdds.datanode.disk.balancer.volume.choosing.policy`         | 
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy`
 | 用于选择平衡的源卷和目标卷的策略类。                                                           
                                      |
+| `hdds.datanode.disk.balancer.container.choosing.policy`      | 
`org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy`
 | 用于选择将哪些容器从源卷移动到目标卷的策略类。                                                      
                   |
+| `hdds.datanode.disk.balancer.service.timeout`                | `300s`        
                         | Datanode DiskBalancer 服务操作超时。                        
                                                                                
            |
+| `hdds.datanode.disk.balancer.should.run.default`             | `false`       
                         | 如果平衡器无法读取其持久配置，则该值决定服务是否应默认运行。                       
                                |
+
diff --git a/hadoop-hdds/docs/content/feature/diskBalancer.png 
b/hadoop-hdds/docs/content/feature/diskBalancer.png
new file mode 100644
index 0000000000..5c146bf062
Binary files /dev/null and b/hadoop-hdds/docs/content/feature/diskBalancer.png 
differ


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@ozone.apache.org
For additional commands, e-mail: commits-h...@ozone.apache.org

(ozone) branch HDDS-5713 updated: HDDS-12598. [DiskBalancer] Add design and feature document (#8837)

Reply via email to