JiangHua Zhu created HDFS-16614:
-----------------------------------
Summary: Improve balancer operation strategy and performance
Key: HDFS-16614
URL: https://issues.apache.org/jira/browse/HDFS-16614
Project: Hadoop HDFS
Issue Type: Improvement
Components: balancer & mover, namenode
Affects Versions: 3.3.0
Reporter: JiangHua Zhu
Attachments: image-2022-06-02-13-18-33-213.png
When the Balancer program is run, it does some work in the following order:
1. Obtain available datanode information from NameNode.
2. Classify and calculate the average utilization according to StorageType.
Here, some sets will be obtained in combination with the set thresholds:
overUtilized, aboveAvgUtilized, belowAvgUtilized, and underUtilized.
3. According to some calculations, the source and target related to the
transfer data are obtained. The source is used for the source end, and the
target is used for the data receiving end.
4. Start the data transfer work in parallel.
In this process, run iteratively. In this process, the threshold is unified and
applied to all StorageTypes, which seems to be a bit rough, because one of the
StorageTypes cannot be distinguished, which is based on the currently supported
heterogeneous storage.
There is an online cluster with more than 2000 nodes, and there is an imbalance
in node storage. E.g:
!image-2022-06-02-13-18-33-213.png!
Here, the average utilization of the cluster is 78%, but the utilization of
most nodes is between 85% and 90%. When the balancer is turned on, we find that
85% of the nodes are working as sources. In this case, we think it is not
reasonable, because it will occupy more network resources in the cluster, and
it will be beneficial to the normal work of the cluster to do some effective
restrictions.
So here are some changes to make:
1. When the balancer is running, it should try to prompt the threshold related
to StorageType. For example [[DISK, 10%], [SSD, 8%]...]
2. Support to set threshold according to StorageType and work.
3. Add an option to prohibit nodes below the threshold from joining the Source
set. This is to allow nodes with high utilization to transfer data as soon as
possible, which is good for balance.
4. Add new support. If there are a lot of datanode usage in the cluster, it
should remain unchanged. For example, the utilization rate of 40% of the nodes
in the cluster is 75% to 80%, and these nodes should not join the Source set.
Of course this support needs to be specified by the user at runtime.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]