thetumbled opened a new issue, #18173:
URL: https://github.com/apache/pulsar/issues/18173
### Motivation
We usually need to do shedding many times util the load balancing is
complete.
### incorrect shedding due to `historical scoring algorithm`
ThresholdShedder implement a historical scoring algorithm when calculating
scores for every broker to handle with the performance fluctuations,that is
```historyScore = historyScore == null ? currentScore : historyScore *
historyPercentage + (1 - historyPercentage) * currentScore;```
But this algorithm will causing incorrect shedding bundles, and resulting
into doing shedding many times until achieving stable state.
For example, say that there is only one broker `broker1` in the cluster,
with cpu usage rate 90%. To reduce the burden of this solo broker, we add a new
broker `broker2` into the cluster with initial cpu usage rate 10%. So the score
of `broker1` is 90, and the score of `broker2` is 10.
Assuming that the shedding algorithm will shedding bundles to make the load
of these two brokers equal.
- So in the first round of load balancing, `broker1` shedding bundles
corresponding to 40% cpu usage to `broker2`, and both of two brokers's cpu
usage rate become 50%, which is good enough.
- But in the second round of shedding checking, the scores of `broker1` is
`0.9*90+0.1*50=86` (the default historyPercentage is 0.9), and the scores of
`broker2` is `10*0.9+50*0.1=14`. and `avg=(86+14)/2=50, 86>avg+10` (default
value of loadBalancerBrokerThresholdShedderPercentage is 10), **so the
algorithm will think that the load of these two brokers are inequal**. Then
`broker1` shedding bundles corresponding to `(86-14)/2=36` cpu usage to
`broker2`, the cpu usage rate of `broker1` become `50-36=14`, the cpu usage
rate of `broker2` become `50+36=86`.
- In the third round of shedding checking, the scores of `broker1` is
`86*0.9+14*0.1=78.8`, and the scores of `broker2` is `14*0.9+86*0.1=21.2`, and
`avg=(78.8+21.2)/2=50, 78.8>avg+10`. **so the algorithm will think that broker1
need to shedding bundles to broker2 Again**. Then `broker1` shedding bundles
corresponding to `(78.8-21.2)/2=28.8` cpu usage to `broker2`, that is no
bundles loaded on `broker1` anymore!
- ......
- After many rounds fo load balancing, we finally achieve stable state.
In fact, we just need only one round of shedding. But the algorithm think
that the load between these two brokers is not even incorrectly due to
`historical scoring algorithm`.
the downside of `historical scoring algorithm` is so huge that we design a
new proposal to handle with the performance fluctuations, and disable the
`historical scoring algorithm`. Introduced in the next section `Multi Hit
Algorithm`.
### Goal
disable the `historical scoring algorithm`, and introduce a new algorithm
to handle with the performance fluctuations.
### API Changes
_No response_
### Implementation
## Multi Hit Algorithm
The performance fluctuations is usual. For example, the cpu usage rate could
increase by 20% suddenly, then it will fall back soon.
So we do not shedding bundles once there is any broker is judged to be
overloaded, instead we count the number of **consecutive hits** of each broker.
When the hit count of any broker is greater than configuration
`HitCountThreshold`, we do shedding and reset the hit count to 0.
The default frequency of doing shedding is once per minutes. Say that we set
`HitCountThreshold` to be 5, then we can deal with performance fluctuations
lasting for 5 minutes.
But this will prolong the waiting time of load balancing when adding new
brokers, that is we have to wait for 5 minutes to trigger load balancing when
adding new brokers.
The solution is that, we could set two kind of `HitCountThreshold` -
`HitCountThresholdForHigh` and `HitCountThresholdForLow`, and set two kind of
`loadBalancerBrokerThresholdShedderPercentage` - `PercentageForHigh` and
`PercentageForLow`.
If the cpu usage of any broker exceeds the average cpu usage
`PercentageForHigh` `HitCountThresholdForHigh` times, we will do shedding,
similarly if the cpu usage of any broker exceeds the average cpu usage
`PercentageForLow` `HitCountThresholdForLow` times, we also do shedding.
We can set `HitCountThresholdForHigh` to be 1, `PercentageForHigh` to be 40,
`PercentageForLow` to be 10, `HitCountThresholdForLow` to be 5, so we can
trigger the load balancing in the first round of shedding because the cpu usage
of new broker is usually lower pretty much than the avg.
### Alternatives
_No response_
### Anything else?
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]