This is an automated email from the ASF dual-hosted git repository. weichiu pushed a commit to branch HDDS-5713 in repository https://gitbox.apache.org/repos/asf/ozone.git
The following commit(s) were added to refs/heads/HDDS-5713 by this push: new 2334f41049 HDDS-12598. [DiskBalancer] Add design and feature document (#8837) 2334f41049 is described below commit 2334f4104972084cf8beed22f47d744e468abf48 Author: Gargi Jaiswal <134698352+gargi-jai...@users.noreply.github.com> AuthorDate: Wed Jul 23 12:59:57 2025 +0530 HDDS-12598. [DiskBalancer] Add design and feature document (#8837) --- hadoop-hdds/docs/content/design/diskbalancer.md | 136 +++++++++++++++++++++ hadoop-hdds/docs/content/feature/DiskBalancer.md | 122 ++++++++++++++++++ .../docs/content/feature/DiskBalancer.zh.md | 116 ++++++++++++++++++ hadoop-hdds/docs/content/feature/diskBalancer.png | Bin 0 -> 116124 bytes 4 files changed, 374 insertions(+) diff --git a/hadoop-hdds/docs/content/design/diskbalancer.md b/hadoop-hdds/docs/content/design/diskbalancer.md new file mode 100644 index 0000000000..5121631c03 --- /dev/null +++ b/hadoop-hdds/docs/content/design/diskbalancer.md @@ -0,0 +1,136 @@ +--- +title: "DiskBalancer for Datanode" +summary: "DiskBalancer is a feature to evenly distribute data across all disks within a Datanode for even disk utilisation." +date: 2025-07-21 +jira: HDDS-5713 +status: implementing +author: Janus Chow, Sammi Chen, Gargi Jaiswal, Stephen O' Donnell +--- +<!-- + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + http://www.apache.org/licenses/LICENSE-2.0 + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. See accompanying LICENSE file. +--> +# [HDDS-5713](https://issues.apache.org/jira/browse/HDDS-5713) DiskBalancer for Datanode (implementing) + +## Background +**Apache Ozone** works well to distribute all containers evenly +across all multiple disks on each Datanode. This initial spread +ensures that I/O load is balanced from the start. However, +over the operational lifetime of a cluster **disk imbalance** can +occur due to the following reasons: +- **Adding new disks** to expand datanode storage space. +- **Replacing old broken disks** with new disks. +- Massive **block** or **replica deletions**. + +This uneven utilisation of disks can create performance bottlenecks, as +**over-utilised disks** become **hotspots** limiting the overall throughput of the +Datanode. As a result, this new feature, **DiskBalancer**, is introduced to +ensure even data distribution across disks within a Datanode. + +## Proposed Solution +The DiskBalancer is a feature which evenly distributes data across +different disks of a Datanode. + +It detects an imbalance within datanode, using the term from +[HDFS DiskBalancer](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html) +metric called **Volume Data Density**. This metric is calculated for +each disk using following formula: + +``` +AverageUtilization = TotalUsed / TotalCapacity +VolumeUtilization = diskUsedSpace / diskCapacity +VolumeDensity = | VolumeUtilization - AverageUtilization | +``` +Here, **VolumeUtilization** is each disk's individual utilization and +**AverageUtilization** is the ideal utilization for all disks to maintain +eveness. + +A disk is considered a candidate for balancing if its `VolumeDataDensity` exceeds a configurable +`threshold`. The DiskBalancer then moves the containers from most +utilised disk to the least utilised disk. DiskBalancer can be triggered manually by **CLI commands**. + +## High-Level DiskBalancer Implementation + +The general view of this design consists of 3 parts as follows: + +**Client & SCM -:** + +Administrators use the `ozone admin datanode diskbalancer` CLI to manage and monitor the feature. +* Clients can control the DiskBalancer job by sending requests to SCM, like start, +stop, update configuration and can query for diskBalancer status. +* Clients get storageReport from SCM to decide which datanode to balance. + +**SCM & DN -:** + +SCM acts as a **control plane** and information hub but remains **stateless** +regarding the balancing process. +* SCM retrieves storageReport and balance status via heartbeat from DN. + +**DN -:** + +All balance operations are done in dataNodes. + +A daemon thread, the **Scheduler**, runs periodically on each Datanode. +1. It calculates the `VolumeDataDensity` for all volumes. +2. If an imbalance is detected (i.e., density > threshold), it moves a set of closed containers +from the most over-utilized disk (source) to the least utilized disk (destination). +3. The scheduler dispatches these move tasks to a pool of **Worker** threads for parallel execution. + +## Container Move Process + +Suppose, we are moving container C1 **(CLOSED state)** from Source Disk d1 to Destination disk d2 : +1. A temporary copy, `Temp C1-CLOSED`, is created in the `temp directory` of the destination disk D2. + +2. `Temp C1-CLOSED` is transitioned to `Temp C1-RECOVERING` state. This **Temp C1-RECOVERING** container is now +atomically moved to the **final destination** directory of D2 as `C1-RECOVERING`. +3. Now **new container** import is initiated for `C1-RECOVERING` container. +4. Once the import is successful, all the metadata updates are done for this new container created on D2. +5. Finally, the original container `C1-CLOSED` on D1 is deleted. + +``` +D1 ----> C1-CLOSED --- (5) ---> C1-DELETED + | + | + (1) + | +D2 ----> Temp C1-CLOSED --- (2) ---> Temp C1-RECOVERING --- (3) ---> C1-RECOVERING --- (4) ---> C1-CLOSED +``` +## DiskBalancing Policies + +By default, the DiskBalancer uses specific policies to decide which disks to balance and which containers to move. These +are configurable, but the default implementations provide robust and safe behavior. + +* **`DefaultVolumeChoosingPolicy`**: This is the default policy for selecting the source and destination volumes. It +identifies the most over-utilized volume as the source and the most under-utilized volume as the destination by comparing +each volume's utilization against the Datanode's average. The calculation is smart enough to account for data that is +already in the process of being moved, ensuring it makes accurate decisions based on the future state of the volumes. + +* **`DefaultContainerChoosingPolicy`**: This is the default policy for selecting which container to move from a source +volume. It iterates through the containers on the source disk and picks the first one that is in a **CLOSED** state +and is not already being moved by another balancing operation. To optimize performance and avoid re-scanning the same +containers repeatedly, it caches the list of containers for each volume which auto expires after one hour of its last +used time or if the container iterator for that is invalidated on full utilisation. + +## DiskBalancer Metrics + +The DiskBalancer service exposes JMX metrics on each Datanode for real-time monitoring. These metrics provide insights +into the balancer's activity, progress, and overall health. + +| DiskBalancer Service Metrics | Description | +|------------------------------------------|--------------------------------------------------------------------------------------------------------------------| +| `SuccessCount` | The number of successful balance jobs. | +| `SuccessBytes` | The total bytes for successfully balanced jobs. | +| `FailureCount` | The number of failed balance jobs. | +| `moveSuccessTime` | The time spent on successful container moves. | +| `moveFailureTime` | The time spent on failed container moves. | +| `runningLoopCount` | The total number of times the balancer's main loop has run. | +| `idleLoopNoAvailableVolumePairCount ` | The number of loops where balancing did not run because no suitable source/destination volume pair could be found. | +| `idleLoopExceedsBandwidthCount` | The number of loops where balancing did not run due to bandwidth limits. | + diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.md b/hadoop-hdds/docs/content/feature/DiskBalancer.md new file mode 100644 index 0000000000..6ef86ec965 --- /dev/null +++ b/hadoop-hdds/docs/content/feature/DiskBalancer.md @@ -0,0 +1,122 @@ +--- +title: "DiskBalancer" +weight: 1 +menu: + main: + parent: Features +summary: DiskBalancer For DataNodes. +--- +<!--- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +## Overview +**Apache Ozone** works well to distribute all containers evenly across all multiple disks on each Datanode. +This initial spread ensures that I/O load is balanced from the start. However, over the operational lifetime of a +cluster **disk imbalance** can occur due to the following reasons: +- **Adding new disks** to expand datanode storage space. +- **Replacing old broken disks** with new disks. +- Massive **block** or **replica deletions**. + +This uneven utilisation of disks can create performance bottlenecks, as **over-utilised disks** become **hotspots** +limiting the overall throughput of the Datanode. As a result, this new feature, **DiskBalancer**, is introduced to +ensure even data distribution across disks within a Datanode. + +A disk is considered a candidate for balancing if its +`VolumeDataDensity` exceeds a configurable `threshold`. DiskBalancer can be triggered manually by **CLI commands**. + + + + +## Command Line Usage +The DiskBalancer is managed through the `ozone admin datanode diskbalancer` command. + +### **Start DiskBalancer** +To start diskBalancer on all Datanodes with default configurations : + +```shell +ozone admin datanode diskbalancer start -a +``` + +You can also start DiskBalancer with specific options: + +```shell +ozone admin datanode diskbalancer start [options] +``` + +### **Update Configurations** +To update DiskBalancer configurations you can use following command: + +```shell +ozone admin datanode diskbalancer update [options] +``` +**Options include:** + +| Options | Description | +|---------------------------------------|-------------------------------------------------------------------------------------------------------| +| `-t, --threshold` | Percentage deviation from average utilization of the disks after which a datanode will be rebalanced. | +| `-b, --bandwithInMB` | Maximum bandwidth for DiskBalancer per second. | +| `-p, --parallelThread` | Max parallelThread for DiskBalancer. | +| `-s, --stop-after-disk-even` | Stop DiskBalancer automatically after disk utilization is even. | +| `-a, --all` | Run commands on all datanodes. | +| `-d, --datanodes` | Run commands on specific datanodes | + +### **Stop DiskBalancer** +To stop DiskBalancer on all Datanodes: + +```shell +ozone admin datanode diskbalancer stop -a +``` +You can also stop DiskBalancer on specific Datanodes: + +```shell +ozone admin datanode diskbalancer stop -d <datanode1> +``` +### **DiskBalancer Status** +To check the status of DiskBalancer on all Datanodes: + +```shell +ozone admin datanode diskbalancer status +``` +You can also check status of DiskBalancer on specific Datanodes: + +```shell +ozone admin datanode diskbalancer status -d <datanode1> +``` +### **DiskBalancer Report** +To get a **volumeDataDensity** of DiskBalancer of top **N** Datanodes (displayed in descending order), +by default N=25, if not specified: + +```shell +ozone admin datanode diskbalancer report --count <N> +``` + +## **DiskBalancer Configurations** + +The DiskBalancer's behavior can be controlled using the following configuration properties in `ozone-site.xml`. + +| Property | Default Value | Description | +| ------------------------------------------------------------ |----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `hdds.datanode.disk.balancer.volume.density.threshold` | `10.0` | A percentage (0-100). A datanode is considered balanced if for each volume, its utilization differs from the average datanode utilization by no more than this threshold. | +| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec` | `10` | The maximum bandwidth (in MB/s) that the balancer can use for moving data, to avoid impacting client I/O. | +| `hdds.datanode.disk.balancer.parallel.thread` | `5` | The number of worker threads to use for moving containers in parallel. | +| `hdds.datanode.disk.balancer.service.interval` | `60s` | The time interval at which the Datanode DiskBalancer service checks for imbalance and updates its configuration. | +| `hdds.datanode.disk.balancer.stop.after.disk.even` | `true` | If true, the DiskBalancer will automatically stop its balancing activity once disks are considered balanced (i.e., all volume densities are within the threshold). | +| `hdds.datanode.disk.balancer.volume.choosing.policy` | `org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy` | The policy class for selecting source and destination volumes for balancing. | +| `hdds.datanode.disk.balancer.container.choosing.policy` | `org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy` | The policy class for selecting which containers to move from a source volume to destination volume. | +| `hdds.datanode.disk.balancer.service.timeout` | `300s` | Timeout for the Datanode DiskBalancer service operations. | +| `hdds.datanode.disk.balancer.should.run.default` | `false` | If the balancer fails to read its persisted configuration, this value determines if the service should run by default. | + diff --git a/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md new file mode 100644 index 0000000000..a44fb07cd7 --- /dev/null +++ b/hadoop-hdds/docs/content/feature/DiskBalancer.zh.md @@ -0,0 +1,116 @@ +--- +title: "磁盘均衡器" +weight: 1 +menu: + main: + parent: 特征 +summary: 数据节点的磁盘平衡器. +--- +<!--- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +## 概述 +**Apache Ozone** 能够有效地将所有容器均匀分布在每个 Datanode 的多个磁盘上。这种初始分布确保了 I/O 负载从一开始就保持平衡。然而, +在集群的整个运行生命周期中,以下原因可能会导致磁盘负载不平衡: +- **添加新磁盘**以扩展数据节点存储空间。 +- **用新磁盘**替换损坏的旧磁盘**。 +- 大量**块**或**副本删除**。 + +磁盘利用率不均衡会造成性能瓶颈,因为**过度利用的磁盘**会成为**热点**,限制 Datanode 的整体吞吐量。因此,我们引入了这项新功能**磁盘平衡器** +,以确保 Datanode 内各个磁盘的数据均匀分布。 + +如果磁盘的`VolumeDataDensity`超过可配置的“阈值”,则该磁盘被视为需要进行平衡。DiskBalancer 可以通过**CLI 命令**手动触发。 + + + +## 命令行用法 +DiskBalancer 通过 `ozone admin datanode diskbalancer` 命令进行管理。 + +### **启动 DiskBalancer** +要在所有 Datanode 上使用默认配置启动 DiskBalancer,请执行以下操作: + +```shell +ozone admin datanode diskbalancer start -a +``` + +您还可以使用特定选项启动 DiskBalancer: +```shell +ozone admin datanode diskbalancer start [options] +``` + +### **更新配置** +要更新 DiskBalancer 配置,您可以使用以下命令: + +```shell +ozone admin datanode diskbalancer update [options] +``` +**选项包括:** + +| 选项 | 描述 | +|------------------------------|---------------------------------| +| `-t, --threshold` | 与磁盘平均利用率的百分比偏差,超过此偏差,数据节点将重新平衡。 | +| `-b, --bandwithInMB` | DiskBalancer 每秒的最大带宽。 | +| `-p, --parallelThread` | DiskBalancer 的最大并行线程。 | +| `-s, --stop-after-disk-even` | 磁盘利用率达到均匀后自动停止 DiskBalancer。 | +| `-a, --all` | 在所有数据节点上运行命令。 | +| `-d, --datanodes` | 在特定数据节点上运行命令 | + +### **停止 DiskBalancer** +要停止所有 Datanode 上的 DiskBalancer,请执行以下操作: + +```shell +ozone admin datanode diskbalancer stop -a +``` +您还可以停止特定 Datanode 上的 DiskBalancer: + +```shell +ozone admin datanode diskbalancer stop -d <datanode1> +``` +### **磁盘平衡器状态** +要检查所有数据节点上的磁盘平衡器状态,请执行以下操作: + +```shell +ozone admin datanode diskbalancer status +``` +您还可以检查特定 Datanode 上 DiskBalancer 的状态: +```shell +ozone admin datanode diskbalancer status -d <datanode1> +``` +### **磁盘平衡器报告** +要获取前**N**个数据节点(按降序显示)的磁盘平衡器**volumeDataDensity**, +默认 N=25,如未指定: + +```shell +ozone admin datanode diskbalancer report --count <N> +``` + +## DiskBalancer Configurations + +The DiskBalancer's behavior can be controlled using the following configuration properties in `ozone-site.xml`. + +| Property | Default Value | Description | +| ------------------------------------------------------------ |----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `hdds.datanode.disk.balancer.volume.density.threshold` | `10.0` | 百分比(0-100)。如果对于每个卷,其利用率与平均数据节点利用率之差不超过此阈值,则认为数据节点处于平衡状态。 | +| `hdds.datanode.disk.balancer.max.disk.throughputInMBPerSec` | `10` | 平衡器可用于移动数据的最大带宽(以 MB/s 为单位),以避免影响客户端 I/O。 | +| `hdds.datanode.disk.balancer.parallel.thread` | `5` | 用于并行移动容器的工作线程数。 | +| `hdds.datanode.disk.balancer.service.interval` | `60s` | Datanode DiskBalancer 服务检查不平衡并更新其配置的时间间隔。 | +| `hdds.datanode.disk.balancer.stop.after.disk.even` | `true` | 如果为真,则一旦磁盘被视为平衡(即所有卷密度都在阈值内),DiskBalancer 将自动停止其平衡活动。 | +| `hdds.datanode.disk.balancer.volume.choosing.policy` | `org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultVolumeChoosingPolicy` | 用于选择平衡的源卷和目标卷的策略类。 | +| `hdds.datanode.disk.balancer.container.choosing.policy` | `org.apache.hadoop.ozone.container.diskbalancer.policy.DefaultContainerChoosingPolicy` | 用于选择将哪些容器从源卷移动到目标卷的策略类。 | +| `hdds.datanode.disk.balancer.service.timeout` | `300s` | Datanode DiskBalancer 服务操作超时。 | +| `hdds.datanode.disk.balancer.should.run.default` | `false` | 如果平衡器无法读取其持久配置,则该值决定服务是否应默认运行。 | + diff --git a/hadoop-hdds/docs/content/feature/diskBalancer.png b/hadoop-hdds/docs/content/feature/diskBalancer.png new file mode 100644 index 0000000000..5c146bf062 Binary files /dev/null and b/hadoop-hdds/docs/content/feature/diskBalancer.png differ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@ozone.apache.org For additional commands, e-mail: commits-h...@ozone.apache.org