This is an automated email from the ASF dual-hosted git repository.
kwang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/pulsar.git
The following commit(s) were added to refs/heads/master by this push:
new 4ac9bc42f22 [improve] [pip] PIP-364: Introduce a new load balance
algorithm AvgShedder (#22946)
4ac9bc42f22 is described below
commit 4ac9bc42f22f8163f59273a0b4ffc46cf3cffdea
Author: Wenzhi Feng <[email protected]>
AuthorDate: Thu Jun 27 16:57:04 2024 +0800
[improve] [pip] PIP-364: Introduce a new load balance algorithm AvgShedder
(#22946)
---
pip/pip-364.md | 476 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 476 insertions(+)
diff --git a/pip/pip-364.md b/pip/pip-364.md
new file mode 100644
index 00000000000..c589b3b47fc
--- /dev/null
+++ b/pip/pip-364.md
@@ -0,0 +1,476 @@
+
+# PIP-364: Introduce a new load balance algorithm AvgShedder
+
+# Background knowledge
+
+Pulsar has two load balance interfaces:
+- `LoadSheddingStrategy` is an unloading strategy that identifies high load
brokers and unloads some of the bundles they carry to reduce the load.
+- `ModularLoadManagerStrategy` is a placement strategy responsible for
assigning bundles to brokers.
+
+## LoadSheddingStrategy
+There are three available algorithms: `ThresholdShedder`, `OverloadShedder`,
`UniformLoadShedder`.
+
+### ThresholdShedder
+`ThresholdShedder` uses the following method to calculate the maximum resource
utilization rate for each broker,
+which includes CPU, direct memory, bandwidth in, and bandwidth out.
+```
+ public double getMaxResourceUsageWithWeight(final double cpuWeight,
+ final double
directMemoryWeight, final double bandwidthInWeight,
+ final double
bandwidthOutWeight) {
+ return max(cpu.percentUsage() * cpuWeight,
+ directMemory.percentUsage() * directMemoryWeight,
bandwidthIn.percentUsage() * bandwidthInWeight,
+ bandwidthOut.percentUsage() * bandwidthOutWeight) / 100;
+ }
+```
+
+After calculating the maximum resource utilization rate for each broker, a
historical weight algorithm will
+also be executed to obtain the final score.
+```
+historyUsage = historyUsage == null ? resourceUsage : historyUsage *
historyPercentage + (1 - historyPercentage) * resourceUsage;
+```
+The historyPercentage is determined by configuring the
`loadBalancerHistoryResourcePercentage`.
+The default value is 0.9, which means that the last calculated score accounts
for 90%,
+while the current calculated score only accounts for 10%.
+
+The introduction of this historical weight algorithm is to avoid bundle
switching caused by
+short-term abnormal load increase or decrease, but in fact, this algorithm
will introduce some
+serious problems, which will be explained in detail later.
+
+Next, calculate the average score of all brokers in the entire cluster:
`avgUsage=totalUsage/totalBrokers`.
+When the score of any broker exceeds a certain threshold of avgUsage, it is
determined that the broker is overloaded.
+The threshold is determined by the configuration
`loadBalancerBrokerThresholdShedderPercentage`, with a default value of 10.
+
+
+### OverloadShedder
+`OverloadShedder` use the same method `getMaxResourceUsageWithWeight` to
calculate the maximum resource utilization rate for each broker.
+The difference is that `OverloadShedder` will not use the historical weight
algorithm to calculate the final score,
+the final score is the current maximum resource utilization rate of the broker.
+
+After obtaining the load score for each broker, compare it with the
`loadBalancerBrokerOverloadedThresholdPercentage`.
+If the threshold is exceeded, it is considered overloaded, with a default
value of 85%.
+
+This algorithm is relatively simple, but there are many serious corner cases,
so it is not recommended to use `OverloadShedder`.
+Here are two cases:
+- When the load on each broker in the cluster reaches the threshold, the
bundle unload will continue to be executed,
+ but it will only switch from one overloaded broker to another, which is
meaningless.
+- If there are no broker whose load reaches the threshold, adding new brokers
will not balance the traffic to the new added brokers.
+The impact of these two points is quite serious, so we won't talk about it
next.
+
+
+### UniformLoadShedder
+`UniformLoadShedder` will first calculate the maximum and minimum message
rates, as well as the maximum and minimum
+traffic throughput and corresponding broker. Then calculate the maximum and
minimum difference, with two thresholds
+corresponding to message rate and throughput size, respectively.
+
+- loadBalancerMsgRateDifferenceShedderThreshold
+
+The message rate percentage threshold between the highest and lowest loaded
brokers, with a default value of 50,
+can trigger bundle unload when the maximum message rate is 1.5 times the
minimum message rate.
+For example, broker 1 with 50K msgRate and broker 2 with 30K msgRate will have
a (50-30)/30=66%>50% difference in msgRate,
+and the load balancer can unload the bundle from broker 1 to broker 2.
+
+- loadBalancerMsgThroughputMultiplierDifferenceShedderThreshold
+
+The threshold for the message throughput multiplier between the highest and
lowest loaded brokers,
+with a default value of 4, can trigger bundle unload when the maximum
throughput is 4 times the minimum throughput.
+For example, if the msgRate of broker 1 is 450MB, broker 2 is 100MB, and the
difference in msgThrough
+is 450/100=4.5>4 times, then the load balancer can unload the bundle from
broker 1 to broker 2.
+
+
+After introducing the algorithm of `UniformLoadShedder`, we can clearly obtain
the following information:
+#### load jitter
+`UniformLoadShedder` does not have the logic to handle load jitter. For
example,
+when the traffic suddenly increases or decreases. This load data point is
adopted, triggering a bundle unload.
+However, the traffic of this topic will soon return to normal, so it is very
likely to trigger a bundle unload again.
+This type of bundle unload should be avoided. This kind of scenario is very
common, actually.
+
+#### heterogeneous environment
+`UniformLoadShedder` does not rely on indicators such as CPU usage and network
card usage to determine high load
+and low load brokers, but rather determines them based on message rate and
traffic throughput size,
+while `ThresholdShedder` and `OverloadShedder` rely on machine resource
indicators such as CPU usage to determine.
+If the cluster is heterogeneous, such as different machines with different
hardware configurations,
+or if there are other processes sharing resources on the machine where the
broker is located,
+`UniformLoadShedder` is likely to misjudge high and low load brokers, thereby
migrating the load from high-performance
+but low load brokers to low-performance but high load brokers.
+Therefore, it is not recommended for users to use `UniformLoadShedder` in
heterogeneous environments.
+
+#### slow load balancing
+`UniformLoadShedder` will only unload the bundle from one of the highest
loaded brokers at a time,
+which may take a considerable amount of time for a large cluster to complete
all load balancing tasks.
+For example, if there are 100 high load brokers in the current cluster and 100
new machines to be added,
+it is roughly estimated that it will take 100 shedding to complete the
balancing.
+However, since the execution time interval of the `LoadSheddingStrategy`
policy is determined by the
+configuration of `loadBalancerSheddingIntervalMinutes`, which defaults to once
every 1 minute,
+so it will take 100 minutes to complete all tasks. For users using large
partition topics, their tasks
+are likely to be disconnected multiple times within this 100 minutes, which
greatly affects the user experience.
+
+
+## ModularLoadManagerStrategy
+The `LoadSheddingStrategy` strategy is used to unload bundles of high load
brokers. However, in order to
+achieve a good load balancing effect, it is necessary not only to "unload"
correctly, but also to "load" correctly.
+The `ModularLoadManagerStrategy` strategy is responsible for assigning bundles
to brokers.
+The coordination between `LoadSheddingStrategy` and
`ModularLoadManagerStrategy` is also a key point worth paying attention to.
+
+### LeastLongTermMessageRate
+The `LeastLongTermMessageRate` algorithm directly used the maximum resource
usage of CPU and so on as the broker's score,
+and reused the `OverloadShedder` configuration,
`loadBalancerBrokerOverloadedThresholdPercentage`.
+If the score is greater than it (default 85%), set `score=INF`; Otherwise,
update the broker's score to the sum of the
+message in and out rates obtained from the broker's long-term aggregation.
+```
+score = longTerm MsgIn rate+longTerm MsgOut rate,
+```
+Finally, randomly select a broker from the broker with the lowest score to
return. If the score of each broker is INF,
+randomly select broker from all brokers.
+
+The scoring algorithm in `LeastLongTermMessageRate` is essentially based on
message rate. Although it initially examines
+the maximum resource utilization, it is to exclude overloaded brokers only.
+Therefore, in most cases, brokers are sorted based on the size of the message
rate as a score, which results in the same
+issues with heterogeneous environments, similar to `UniformLoadShedder`.
+
+
+#### Effect of the combination of `LoadSheddingStrategy` and
`LeastLongTermMessageRate`
+Next, we will attempt to analyze the effect together with the
`LoadSheddingStrategy`.
+- **LeastLongTermMessageRate + OverloadShedder**
+This is the initial combination, but due to some inherent flaws in
`OverloadShedder`, **it is not recommended**.
+
+- **LeastLongTermMessageRate + ThresholdShedder**
+This combination is even worse than `LeastLongTermMessageRate +
OverloadShedder` and **is not recommended**.
+Because `OverloadShedder` uses the maximum weighted resource usage and
historical score to score brokers,
+while LeastLongTermMessage Rate is scored based on message rate. Inconsistent
unloading and placement criteria
+can lead to incorrect load balancing execution.
+This is also why a new placement strategy `LeastResourceUsageWithWeight` will
be introduced later.
+
+- **LeastLongTermMessageRate + UniformLoadShedder**
+This is **recommended**. Both uninstallation and placement policy are based on
message rate,
+but using message rate as a standard naturally leads to issues with
heterogeneous environments.
+
+
+### LeastResourceUsageWithWeight
+`LeastResourceUsageWithWeight` uses the same scoring algorithm as
`ThresholdShedder` to score brokers, which uses
+weighted maximum resource usage and historical scores to calculate the current
score.
+
+Next, select candidate brokers based on the configuration of
`loadBalancerAverageResourceUsageDifferenceThresholdPercentage`.
+If a broker's score plus this threshold is still not greater than the average
score, the broker will be added to the
+candidate broker list. After obtaining the candidate broker list, a broker
will be randomly selected from it;
+If there are no candidate brokers, randomly select from all brokers.
+
+For example, if the resource utilization rate of broker 1 is 10%, broker 2 is
30%, and broker 3 is 80%,
+the average resource utilization rate is 40%. The placement strategy can
choose Broker1 and Broker2
+as the best candidates, as the thresholds are 10, 10+10<=40, 30+10<=40. In
this way, the bundles uninstalled
+from broker 3 will be evenly distributed among broker 1 and broker 2, rather
than being completely placed on broker 1.
+
+#### over placement problem
+Over placement problem is that the bundle is placed on high load brokers and
make them overloaded.
+
+In practice, it will be found that it is difficult to determine a suitable
value for `loadBalancerAverageResourceUsageDifferenceThresholdPercentage`,
+which often triggers a fallback global random selection logic. For example, if
there are 6 brokers in the current
+cluster, with scores of 40, 40, 40, 40, 69, and 70 respectively, the average
score is 49.83.
+Using the default configuration, there are no candidate brokers because
40+10>49.83.
+Triggering a bottom-up global random selection logic and the bundle may be
offloaded from the overloaded broker5
+to the overloaded broker6, or vice versa, **causing the over placement
problem.**
+
+Attempting to reduce the configuration value to expand the random pool, such
as setting it to 0, may also include some
+overloaded brokers in the candidate broker list. For example, if there are 5
brokers in the current cluster with scores
+of 10, 60, 70, 80, and 80 respectively, the average score is 60. As the
configuration value is 0, then broker 1 and
+broker 2 are both candidate brokers. If broker 2 shares half of the offloaded
traffic, **it is highly likely to overload.**
+
+Therefore, it is difficult to configure the `LeastResourceUsageWithWeight`
algorithm well to avoid incorrect load balancing.
+Of course, if you want to use the `ThresholdShedder` algorithm, the
combination of `ThresholdShedder+LeastResourceUsageWithWeight`
+will still be superior to the combination of
`ThresholdShedder+LeastLongTermMessageRate`, because at least the scoring
algorithm
+of `LeastResourceUsageWithWeight` is consistent with that of
`ThresholdShedder`.
+
+#### why doesn't LeastLongTermMessage Rate have over placement problem?
+The root of over placement problem is that the frequency of updating the load
data is limited due to the performance
+of zookeeper. If we assign a bundle to a broker, the broker's load will
increase after a while, and it's load data
+also need some time to be updated to leader broker. If there are many bundles
unloaded in a shedding,
+how can we assign these bundles to brokers?
+
+The most simple way is to assign them to the broker with the lowest load, but
it may cause the over placement problem
+as it is most likely that there is only one single broker with the lowest
load. With all bundles assigned to this broker,
+it will be overloaded. This is the reason why `LeastResourceUsageWithWeight`
try to determine a candidate broker list
+to avoid the over placement problem. But we also find that candidate broker
list can be empty or include some overloaded
+brokers, which will also cause the over placement problem.
+
+So why doesn't `LeastLongTermMessageRate` have over placement problem? The
reason is that each time a bundle is assigned,
+the bundle will be added into `PreallocatedBundleData`. When scoring a broker,
not only will the long-term message rate
+aggregated by the broker itself be used, but also the message rate of bundles
in `PreallocatedBundleData` that have been
+assigned to the broker but have not yet been reflected in the broker's load
data will be calculated.
+
+For example, if there are two bundles with 20KB/s message rate to be assigned,
and broker1 and broker2 at 100KB/s
+and 110KB/s respectively. The first bundle is assigned to broker1, However,
broker1's load data will not be updated
+in the short term. Before the load data is updated, `LeastLongTermMessageRate`
try to assign the second bundle.
+At this time, the score of broker1 is 100+20=120KB/s, where 20KB/s is the
message rate of the first bundle
+from `PreallocatedBundleData`. As broker1's score is greater than broker2, the
second bundle will be assigned to broker2.
+
+**`LeastLongTermMessageRate` predict the load of the broker after the bundle
is assigned to avoid the over placement problem.**
+
+**Why doesn't `LeastResourceUsageWithWeight` have this feature? Because it is
not possible to predict how much resource
+utilization a broker will increase when loading a bundle. All algorithms
scoring brokers based on resource utilization
+can't fix the over placement problem with this feature.**
+So `LeastResourceUsageWithWeight` try to determine a candidate broker list to
avoid the over placement problem, which is
+proved to be not a good solution.
+
+
+#### over unloading problem
+Over unloading problem is that the load offloaded from high load brokers is
too much and make them underloaded.
+
+Finally, let's talk about the issue of historical weighted scoring algorithms.
The historical weighted scoring algorithm
+is used by the `ThresholdShedder` and `LeastResourceUsageWithWeight`
algorithms, as follows:
+```
+HistoryUsage=historyUsage=null? ResourceUsage: historyUsage *
historyPercentage+(1- historyPercentage) * resourceUsage;
+```
+The default value of historyPercentage is 0.9, indicating that the score
calculated last time has a significant impact on the current score.
+The current maximum resource utilization only accounts for 10%, which is to
solves the problem of load jitter.
+However, introducing this algorithm has its side effects, such as over
unloading problem.
+
+For example, there is currently one broker1 in the cluster with a load of 90%,
and broker2 is added with a current load of 10%.
+- At the first execution of shedding: broker1 scores 90, broker2 scores 10.
For simplicity, assuming that the algorithm will
+move some bundles to make their load the same, thus the true load of broker 1
and broker 2 become 50 after load shedding is completed.
+- At the second execution of shedding: broker1 scores 90*0.9+50*0.1=86,
broker2 scores 10*0.9+50*0.1=14.
+**Note that the actual load of broker1 here is 50, but it is overestimated as
86!**
+**The true load of broker2 is also 50, but it is underestimated at 14!**
+Due to the significant difference in ratings between the two, although their
actual loads are already the same,
+broker1 will continue to unload traffic corresponding to 36 points from
broker1 to broker2,
+resulting in broker1's actual load score becoming 14, broker2's actual load
score becoming 86.
+
+- At the third execution of shedding: broker1 scored 86*0.9+14*0.1=78.8,
broker2 scored 14*0.9+86*0.1=21.2.
+It is ridiculous that broker1 is still considered overloaded, and broker2 is
still considered underloaded.
+All loads in broker1 are moved to broker2, which is the over unloading problem.
+
+Although this example is an idealized theoretical analysis, we can still see
that using historical scoring algorithms
+can seriously overestimate or underestimate the true load of the broker.
Although it can avoid the problem of load jitter,
+it will introduce a more serious and broader problem: **overestimating or
underestimating the true load of the broker,
+leading to incorrect load balancing execution**.
+
+
+## Summary
+Based on the previous analysis, although we have three shedding strategies and
two placement strategies
+that can generate 6 combinations of 3 * 2, we actually only have two
recommended options:
+- ThresholdShedder + LeastResourceUsageWithWeight
+- UniformLoadShedder + LeastLongTermMessageRate
+
+These two options each have their own advantages and disadvantages, and users
can choose one according to
+their requirements. The following table summarizes the advantages and
disadvantages of the two options:
+
+| Combination | heterogeneous environment |
load jitter | over placement problem | over unloading problem | slow load
balancing |
+|---------------------------------------------|---------------------------|------------|-----------------------|-----------------------|---------------------|
+| ThresholdShedder + LeastResourceUsageWithWeight | normal(1)
| good | bad | bad | normal(1)
|
+| UniformLoadShedder + LeastLongTermMessageRate | bad(2) |
bad | good | good | normal(1)
|
+
+1. In terms of adapting to heterogeneous environments,
`ThresholdShedder+LeastResourceUsageWithWeight` can
+only be rated as `normal`. This is because `ThresholdShedder` is not fully
adaptable to heterogeneous environments.
+Although it does not misjudge overloaded brokers as underloaded, heterogeneous
environments can still have a
+significant impact on the load balancing effect of `ThresholdShedder`.
+For example, there are three brokers in the current cluster with resource
utilization rates of 10, 50, and 70, respectively.
+Broker1 and Broker2 are isomorphic. Though Broker3 don't bear any load, its
resource utilization rate has
+reached to 70 due to the deployment of other processes at the same machine.
+At this point, we would like broker 1 to share some of the pressure from
broker2, but since the average load is
+43.33, 43.33+10>50, broker2 will not be judged as overloaded, and overloaded
broker 3 also has no traffic to
+unload, causing the load balancing algorithm to be in an inoperable state.
+
+2. In the same scenario, if `UniformLoadShedder+LeastLongTermMessageRate` is
used, the problem will be more
+severe, as some of the load will be offloaded from broker2 to broker3. As a
result, the performance of those
+topics in broker3 services will experience significant performance degradation.
+Therefore, it is not recommended to run Pulsar in heterogeneous environments
as current load balancing algorithms
+cannot adapt too well. If it is unavoidable, it is recommended to choose
`ThresholdShedder+LeastResourceUsageWithWeight`.
+
+3. In terms of load balancing speed, although
`ThresholdShedder+LeastResourceUsageWithWeight` can unload the load
+of all overloaded brokers at once, historical scoring algorithms can seriously
affect the accuracy of load
+balancing decisions. Therefore, in reality, it also requires multiple load
balancing executions to finally
+stabilize. This is why the load balancing speed of
`ThresholdShedder+LeastResourceUsageWithWeight` is rated as `normal`.
+
+4. In terms of load balancing speed,
`UniformLoadShedder+LeastLongTermMessageRate` can only unload the load of one
+overloaded broker at a time, so it takes a long time to complete load
balancing when there are many brokers,
+so it is also rated as `normal`.
+
+
+# Motivation
+
+The current load balance algorithm has some serious problems, such as load
jitter, heterogeneous environment, slow load balancing, etc.
+This PIP aims to introduce a new load balance algorithm `AvgShedder` to solve
these problems.
+
+# Goals
+
+Introduce a new load balance algorithm `AvgShedder` that can solve the
problems of load jitter, heterogeneous environment, slow load balancing, etc.
+
+
+# High Level Design
+
+## scoring criterion
+First of all, to determine high load brokers, it is necessary to rate and sort
them.
+Currently, there are two scoring criteria:
+- Resource utilization rate of broker
+- The message rate and throughput of the broker
+Based on the previous analysis, it can be seen that scoring based on message
rate and throughput will face
+the same problem as `UniformLoadShedder` in heterogeneous environments, while
scoring based on resource utilization
+rate will face the over placement problem like `LeastResourceUsageWithWeight`.
+
+**To solve the problem of heterogeneous environments, we use the resource
utilization rate of the broker as the scoring criterion.**
+
+
+## binding shedding and placement strategies
+So how can we avoid the over placement problem? **The key is to bind the
shedding and placement strategies together.**
+If every bundle unloaded from the high load broker is assigned to the right
low load broker in shedding strategy,
+the over placement problem will be solved.
+
+For example, if the broker rating of the current cluster is 20,30,52,80,80,
and the shedding and placement strategies are decoupled,
+the bundles will be unloaded from the two brokers with score of 80, and then
all these bundles will be placed on the broker with a
+score of 20, causing the over placement problem.
+
+If the shedding and placement strategies are coupled, one broker with 80 score
can unload some bundles to a broker with 20 score,
+and another broker with 80 score can unload the bundle to the broker with 30
score. In this way, we can avoid the over placement problem.
+
+
+## evenly distributed traffic between the highest and lowest loaded brokers
+We will first pick out the highest and lowest loaded brokers, and then evenly
distribute the traffic between them.
+
+For example, if the broker rating of the current cluster is 20,30,52,70,80,
and the message rate of the highest loaded broker is 1000,
+the message rate of the lowest loaded broker is 500. We introduce a threshold
to whether trigger the bundle unload, for example,
+the threshold is 40. As the difference between the score of the highest and
lowest loaded brokers is 100-50=50>40,
+the shedding strategy will be triggered.
+
+To achieve the goal of evenly distributing the traffic between the highest and
lowest loaded brokers, the shedding strategy will
+try to make the message rate of two brokers the same, which is
(1000+500)/2=750. The shedding strategy will unload 250 message rate from the
+highest loaded broker to the lowest loaded broker. After the shedding strategy
is completed, the message rate of two brokers will be
+same, which is 750.
+
+
+## improve the load balancing speed
+As we mentioned earlier in `UniformLoadShedder`, if strategy only handles one
high load broker at a time, it will take a long time to
+complete all load balancing tasks. Therefore, we further optimize it by
matching multiple pairs of high and low load brokers in
+a single shedding. After sorting the broker scores, the first and last place
are paired, the second and and the second to last are paired,
+and so on. When the score difference between the two paired brokers is greater
than the threshold, the load will be evenly distributed
+between the two, which can solve the problem of slow speed.
+
+For example, if the broker rating of the current cluster is 20,30,52,70,80, we
will pair 20 and 80, 30 and 70. As the difference between
+the two paired brokers is 80-20=60, 70-30=40, which are both greater than the
threshold 40, the shedding strategy will be triggered.
+
+
+## handle load jitter with multiple hits threshold
+What about the historical weighting algorithm used in `ThresholdShedder`? It
is used to solve the problem of load jitter, but previous
+analysis and experiments have shown that it can bring serious negative
effects, so we can no longer use this method to solve the
+problem of load jitter.
+
+We mimic the way alarms are triggered: the threshold is triggered multiple
times before the bundle unload is finally triggered.
+For example, when the difference between a pair of brokers exceeds the
threshold three times, load balancing is triggered.
+
+## high and low threshold
+In situations of cluster rolling restart or expansion, there is often a
significant load difference between
+different brokers, and we hope to complete load balancing more quickly.
+
+Therefore, we introduce two thresholds:
+- loadBalancerAvgShedderLowThreshold, default value is 15
+- loadBalancerAvgShedderHighThreshold, default value is 40
+
+Two thresholds correspond to two continuous hit count requirements:
+- loadBalancerAvgShedderHitCountLowThreshold, default value is 8
+- loadBalancerAvgShedderHitCountHighThreshold, default value of 2
+
+When the difference in scores between two paired brokers exceeds the
`loadBalancerAvgShedderLowThreshold` by
+`loadBalancerAvgShedderHitCountLowThreshold` times, or exceeds the
`loadBalancerAvgShedderHighThreshold` by
+`loadBalancerAvgShedderHitCountHighThreshold` times, a bundle unload is
triggered.
+For example, with the default value, if the score difference exceeds 15, it
needs to be triggered 8 times continuously,
+and if the score difference exceeds 40, it needs to be triggered 2 times
continuously.
+
+The larger the load difference between brokers, the smaller the number of
times it takes to trigger bundle unloads,
+which can adapt to scenarios such as cluster rolling restart or expansion.
+
+## placement strategy
+As mentioned earlier, `AvgShedder` bundles the shedding and placement
strategies, and a bundle has already determined
+its next owner broker based on the shedding strategy during shedding. But we
not only use placement strategies after
+executing shedding, but also need to use placement strategies to assign
bundles during cluster initialization, rolling
+restart, and broker shutdown. So how should we assign these bundles without
shedding strategies?
+
+We use a hash allocation method: hash mapping a random number to broker. Hash
mapping roughly conforms to
+a uniform distribution, so bundles will be roughly evenly distributed across
all brokers. However, due to the different
+throughput between different bundles, the cluster will exhibit a certain
degree of imbalance. However, this problem is
+not significant, and the subsequent balancing can be achieved through shedding
strategies. Moreover, the frequency of
+cluster initialization, rolling restart, and broker shutdown scenarios is not
high, so the impact is slight.
+
+## summary
+In summary, `AvgShedder` can solve the problems of load jitter, heterogeneous
environment, slow load balancing, etc.
+Following table summarizes the advantages and disadvantages of the three
options:
+
+| Combination | heterogeneous environment |
load jitter | over placement problem | over unloading problem | slow load
balancing |
+|---------------------------------------------|------------------------|------------|-----------------------|-----------------------|--------------|
+| ThresholdShedder + LeastResourceUsageWithWeight | normal |
good | bad | bad | normal |
+| UniformLoadShedder + LeastLongTermMessageRate | bad | bad
| good | good | normal |
+| AvgShedder | normal | good
| good | good | good |
+
+
+# Detailed Design
+
+### Configuration
+
+To avoid introducing too many configurations when calculating how much traffic
needs to be unloaded, `AvgShedder` reuses the
+following three `UniformLoadShedder` configurations:
+```
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "In the UniformLoadShedder strategy, the minimum message
that triggers unload."
+ )
+ private int minUnloadMessage = 1000;
+
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "In the UniformLoadShedder strategy, the minimum throughput
that triggers unload."
+ )
+ private int minUnloadMessageThroughput = 1 * 1024 * 1024;
+
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "In the UniformLoadShedder strategy, the maximum unload
ratio."
+ )
+ private double maxUnloadPercentage = 0.2;
+```
+
+The `maxUnloadPercentage` controls the allocation ratio. Although the default
value is 0.2, our goal is to evenly distribute the
+pressure between two brokers. Therefore, we set the value to 0.5, so that
after load balancing is completed, the message rate/throughput
+of the two brokers will be almost equal.
+
+The following configurations are introduced to control the shedding strategy:
+```
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "The low threshold for the difference between the highest
and lowest loaded brokers."
+ )
+ private int loadBalancerAvgShedderLowThreshold = 15;
+
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "The high threshold for the difference between the highest
and lowest loaded brokers."
+ )
+ private int loadBalancerAvgShedderHighThreshold = 40;
+
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "The number of times the low threshold is triggered before
the bundle is unloaded."
+ )
+ private int loadBalancerAvgShedderHitCountLowThreshold = 8;
+
+ @FieldContext(
+ dynamic = true,
+ category = CATEGORY_LOAD_BALANCER,
+ doc = "The number of times the high threshold is triggered before
the bundle is unloaded."
+ )
+ private int loadBalancerAvgShedderHitCountHighThreshold = 2;
+```
+
+
+
+# Backward & Forward Compatibility
+
+Fully compatible.
+
+# General Notes
+
+# Links
+
+* Mailing List discussion thread:
https://lists.apache.org/thread/cy39b6jp38n38zyzd3bbw8b9vm5fwf3f
+* Mailing List voting thread:
https://lists.apache.org/thread/2v9fw5t5m5hlmjkrvjz6ywxjcqpmd02q