[
https://issues.apache.org/jira/browse/HDFS-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
JiangHua Zhu updated HDFS-16614:
--------------------------------
Affects Version/s: 2.9.2
(was: 3.3.0)
> Improve balancer operation strategy and performance
> ---------------------------------------------------
>
> Key: HDFS-16614
> URL: https://issues.apache.org/jira/browse/HDFS-16614
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: balancer & mover, namenode
> Affects Versions: 2.9.2
> Reporter: JiangHua Zhu
> Assignee: JiangHua Zhu
> Priority: Major
> Attachments: image-2022-06-02-13-18-33-213.png
>
>
> When the Balancer program is run, it does some work in the following order:
> 1. Obtain available datanode information from NameNode.
> 2. Classify and calculate the average utilization according to StorageType.
> Here, some sets will be obtained in combination with the set thresholds:
> overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
> 3. According to some calculations, the source and target related to the
> transfer data are obtained. The source is used for the source end, and the
> target is used for the data receiving end.
> 4. Start the data transfer work in parallel.
> In this process, run iteratively. In this process, the threshold is unified
> and applied to all StorageTypes, which seems to be a bit rough, because one
> of the StorageTypes cannot be distinguished, which is based on the currently
> supported heterogeneous storage.
> There is an online cluster with more than 2000 nodes, and there is an
> imbalance in node storage. E.g:
> !image-2022-06-02-13-18-33-213.png!
> Here, the average utilization of the cluster is 78%, but the utilization of
> most nodes is between 85% and 90%. When the balancer is turned on, we find
> that 85% of the nodes are working as sources. In this case, we think it is
> not reasonable, because it will occupy more network resources in the cluster,
> and it will be beneficial to the normal work of the cluster to do some
> effective restrictions.
> So here are some changes to make:
> 1. When the balancer is running, it should try to prompt the threshold
> related to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
> 2. Support to set threshold according to StorageType and work.
> 3. Add an option to prohibit nodes below the threshold from joining the
> Source set. This is to allow nodes with high utilization to transfer data as
> soon as possible, which is good for balance.
> 4. Add new support. If there are a lot of datanode usage in the cluster, it
> should remain unchanged. For example, the utilization rate of 40% of the
> nodes in the cluster is 75% to 80%, and these nodes should not join the
> Source set. Of course this support needs to be specified by the user at
> runtime.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]