khazhen created HDFS-17867:
------------------------------
Summary: Implement a new NetworkTopology that supports weighted
random choose
Key: HDFS-17867
URL: https://issues.apache.org/jira/browse/HDFS-17867
Project: Hadoop HDFS
Issue Type: Improvement
Reporter: khazhen
h2. Background
In the current BlockPlacementPolicyDefault, each DN in the cluster is
selected with roughly equal probability.
However, in our cluster, there are various types of DataNode machines with
completely different hardware specifications.
For example, some machines have more hard drives, higher bandwidth network
cards, faster CPUs, etc., while some older machines
are the opposite. Their service capacity is much lower than other newer
machines (due to inferior hardware specifications).
Therefore, as the overall cluster load increases, these lower-performance
machines immediately become bottlenecks,
causing the entire cluster's performance to decline, or even affecting
availability (such as frequent slow nodes or
multiple PipelineRecovery failures), while high-performance machines have low
loads, resulting in significant resource waste.
The root cause of this problem is that we don't have a good method to
achieve load balancing between DN nodes.
Initially, our solution was to group DN machines according to their hardware
specifications, then expand them into
different clusters separately, and finally provide services externally using a
federation cluster based on RBF.
We indirectly achieved the purpose of controlling DN load balancing through
directory partitioning. This approach
has high operational costs, and it's difficult for humans to precisely divide
the directory tree.
h2. Solution
To better solve this problem, we implemented a NetworkTopology that can
select DNs based on weights.
We can configure a weight value for each DN similar to how we configure
racks. When choosing targets, it will sample
according to the configured weights. For clusters containing DNs with different
hardware specifications, introducing
this feature has several benefits:
# Better load balancing between DNs. High-performance machines can handle more
traffic, and the overall service
capacity of the cluster will be improved.
# Higher resource utilization.
# Reduced overhead from Balancer. Typically, higher-performance machines mean
more hard drives and larger
capacity. If we configure weights according to capacity ratios, the
amount of data that needs to be moved by
Balancer will be significantly reduced. (Of course, Balancer is still
needed for expansion scenarios.)
Our production cluster has many different types of hardware specifications
for DN machines, and some machines
can have capacities up to 10 times that of some older models. Additionally,
some machines are co-deployed with many
other computing services, causing them to immediately become slow nodes once
traffic increases. After introducing this
feature, we let independently deployed high-performance, large-capacity
machines handle more traffic, and both the
overall IO performance and availability of the cluster have been significantly
improved.
Our cluster's Hadoop version is still at 2.x, so we directly modified the
NetworkTopology class to implement this
feature. However, in the latest version, DFSNetworkTopology has been introduced
as the default implementation.
Therefore, I attempted to re-implement this feature based on
DFSNetworkTopology. I will introduce the details next.
h2. Implementation
Let's have a look at the chooseRandomWithStorageType method of
DFSNetworkTopology. Consider we have 3 dn in the cluster,
dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
/
/r1
/dn1
/dn2
/r2
/dn3
There are 3 core steps to choose a random dn from root scope:
1. compute num of available nodes under r1 and r2, which is [2, 1] in this
case.
2. perform a weighted random choose from [r1, r2] with weight [2, 1],
assume r1 is chosen
3. as r1 is a rack inner node, randomly choose a dn from its children list
[dn1, dn2]
The probability of each of these three dn being chosen is 1/3.
Now we want to introduce a weighted random choose from [dn1, dn2, dn3] with
weight [3, 1, 2]. A simple and straightforward
solution is to add virtual nodes to the topology tree, and the new topology
tree looks like this:
/
/r1
/dn1'
/dn1'
/dn1'
/dn2'
/r2
/dn3'
/dn3'
The probability of each of these virtual nodes being chosen is 1/6, and dn1
has 3 virtual nodes, so the probability of
choosing dn1 is 1/2, and 1/6, 1/3 for dn2 and dn3 respectively.
However, upon reviewing steps 1 through 3, we can see that step 1 and 2
only care about the number of data nodes under
inner node, this means that we don't need to really add virtual nodes to the
topology tree, instead, we can introduce a
new method getNodeCount(Node n), it accepts a node as input, and returns the
number of data nodes under n. In the old
DFSNetworkTopology class, it just returns the number of physical data nodes
under n. Then we can add a new subclass
of DFSNetworkTopology which overrides getNodeCount(Node n) to return the total
weight of all data nodes under n.
The step 3 needs to be modified as well, we should perform a weighted
random choose from child list rather than a simple random
choose.
h2. Difference with AvailableSpaceBlockPlacementPolicy
AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the
cluster, it makes the new added nodes
being chosen with a little high possibility than the old ones, and the cluster
will trend to be balanced after a period of time.
The real time load of newly added nodes won't change much.
This feature focuses on the real time load balancing between data nodes,
it's useful in the cluster that has many different
types of data nodes.
By the way, it is a very useful feature to make the weight of nodes
reconfigurable without restarting namenode.
It allows us to quickly adjust weights based on the actual load of the cluster.
I will introduce this feature in a
separate JIRA after this one is completed.
I have submitted a PR. More suggestions and discussions are welcomed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]