khazhen created HDFS-17867:
------------------------------

             Summary: Implement a new NetworkTopology that supports weighted 
random choose
                 Key: HDFS-17867
                 URL: https://issues.apache.org/jira/browse/HDFS-17867
             Project: Hadoop HDFS
          Issue Type: Improvement
            Reporter: khazhen


h2. Background

    In the current BlockPlacementPolicyDefault, each DN in the cluster is 
selected with roughly equal probability.
However, in our cluster, there are various types of DataNode machines with 
completely different hardware specifications.
For example, some machines have more hard drives, higher bandwidth network 
cards, faster CPUs, etc., while some older machines
are the opposite. Their service capacity is much lower than other newer 
machines (due to inferior hardware specifications).
Therefore, as the overall cluster load increases, these lower-performance 
machines immediately become bottlenecks,
causing the entire cluster's performance to decline, or even affecting 
availability (such as frequent slow nodes or
multiple PipelineRecovery failures), while high-performance machines have low 
loads, resulting in significant resource waste.

    The root cause of this problem is that we don't have a good method to 
achieve load balancing between DN nodes.
Initially, our solution was to group DN machines according to their hardware 
specifications, then expand them into
different clusters separately, and finally provide services externally using a 
federation cluster based on RBF.
We indirectly achieved the purpose of controlling DN load balancing through 
directory partitioning. This approach
has high operational costs, and it's difficult for humans to precisely divide 
the directory tree.

h2. Solution

    To better solve this problem, we implemented a NetworkTopology that can 
select DNs based on weights.
    We can configure a weight value for each DN similar to how we configure 
racks. When choosing targets, it will sample
according to the configured weights. For clusters containing DNs with different 
hardware specifications, introducing
this feature has several benefits:

# Better load balancing between DNs. High-performance machines can handle more 
traffic, and the overall service
    capacity of the cluster will be improved.
#  Higher resource utilization.
#  Reduced overhead from Balancer. Typically, higher-performance machines mean 
more hard drives and larger
       capacity. If we configure weights according to capacity ratios, the 
amount of data that needs to be moved by
       Balancer will be significantly reduced. (Of course, Balancer is still 
needed for expansion scenarios.)

    Our production cluster has many different types of hardware specifications 
for DN machines, and some machines
can have capacities up to 10 times that of some older models. Additionally, 
some machines are co-deployed with many
other computing services, causing them to immediately become slow nodes once 
traffic increases. After introducing this
feature, we let independently deployed high-performance, large-capacity 
machines handle more traffic, and both the
overall IO performance and availability of the cluster have been significantly 
improved.

    Our cluster's Hadoop version is still at 2.x, so we directly modified the 
NetworkTopology class to implement this
feature. However, in the latest version, DFSNetworkTopology has been introduced 
as the default implementation.
Therefore, I attempted to re-implement this feature based on 
DFSNetworkTopology. I will introduce the details next.

h2. Implementation
    Let's have a look at the chooseRandomWithStorageType method of 
DFSNetworkTopology. Consider we have 3 dn in the cluster,
 dn1(/r1), dn2(/r1), dn3(/r2). The topology tree looks like this:
    /
        /r1
            /dn1
            /dn2
        /r2
            /dn3
    There are 3 core steps to choose a random dn from root scope:
    1. compute num of available nodes under r1 and r2, which is [2, 1] in this 
case.
    2. perform a weighted random choose from [r1, r2] with weight [2, 1], 
assume r1 is chosen
    3. as r1 is a rack inner node, randomly choose a dn from its children list 
[dn1, dn2]
    The probability of each of these three dn being chosen is 1/3.

    Now we want to introduce a weighted random choose from [dn1, dn2, dn3] with 
weight [3, 1, 2]. A simple and straightforward
 solution is to add virtual nodes to the topology tree, and the new topology 
tree looks like this:
    /
        /r1
            /dn1'
            /dn1'
            /dn1'
            /dn2'
        /r2
            /dn3'
            /dn3'
    The probability of each of these virtual nodes being chosen is 1/6, and dn1 
has 3 virtual nodes, so the probability of
 choosing dn1 is 1/2, and 1/6, 1/3 for dn2 and dn3 respectively.
    However, upon reviewing steps 1 through 3, we can see that step 1 and 2 
only care about the number of data nodes under
inner node, this means that we don't need to really add virtual nodes to the 
topology tree, instead, we can introduce a
new method getNodeCount(Node n), it accepts a node as input, and returns the 
number of data nodes under n. In the old
DFSNetworkTopology class, it just returns the number of physical data nodes 
under n. Then we can add a new subclass
of DFSNetworkTopology which overrides getNodeCount(Node n) to return the total 
weight of all data nodes under n.
    The step 3 needs to be modified as well, we should perform a weighted 
random choose from child list rather than a simple random
choose.

h2. Difference with AvailableSpaceBlockPlacementPolicy
    AvailableSpaceBlockPlacementPolicy is useful when we add new nodes to the 
cluster, it makes the new added nodes
being chosen with a little high possibility than the old ones, and the cluster 
will trend to be balanced after a period of time.
The real time load of newly added nodes won't change much.
    This feature focuses on the real time load balancing between data nodes, 
it's useful in the cluster that has many different
types of data nodes.

    By the way, it is a very useful feature to make the weight of nodes 
reconfigurable without restarting namenode.
It allows us to quickly adjust weights based on the actual load of the cluster. 
I will introduce this feature in a
separate JIRA after this one is completed.
    I have submitted a PR. More suggestions and discussions are welcomed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to