thetumbled opened a new issue, #18173:
URL: https://github.com/apache/pulsar/issues/18173

   ### Motivation
   
   We usually need to do shedding many times util the load balancing is 
complete.
   
   ### incorrect shedding due to `historical scoring algorithm`  
     ThresholdShedder implement a historical scoring algorithm when calculating 
scores for every broker to handle with the performance fluctuations,that is  
   ```historyScore = historyScore == null ? currentScore : historyScore * 
historyPercentage + (1 - historyPercentage) * currentScore;```   
   But this algorithm will causing incorrect shedding bundles, and resulting 
into doing shedding many times until achieving stable state.  
   For example, say that there is only one broker `broker1` in the cluster, 
with cpu usage rate 90%. To reduce the burden of this solo broker, we add a new 
broker `broker2` into the cluster with initial cpu usage rate 10%. So the score 
of  `broker1` is 90, and the score of  `broker2` is 10.
   Assuming that the shedding algorithm will shedding bundles to make the load 
of these two brokers equal.  
   - So in the first round of load balancing, `broker1` shedding bundles 
corresponding to 40% cpu usage to `broker2`, and both of two brokers's cpu 
usage rate become 50%, which is good enough.   
   - But in the second round of shedding checking, the scores of `broker1` is 
`0.9*90+0.1*50=86` (the default historyPercentage is 0.9), and the scores of 
`broker2` is `10*0.9+50*0.1=14`. and `avg=(86+14)/2=50, 86>avg+10` (default 
value of loadBalancerBrokerThresholdShedderPercentage is 10), **so the 
algorithm will think that the load of these two brokers are inequal**. Then 
`broker1` shedding bundles corresponding to `(86-14)/2=36` cpu usage to 
`broker2`, the cpu usage rate of `broker1` become `50-36=14`, the cpu usage 
rate of `broker2` become `50+36=86`.
   - In the third round of shedding checking, the scores of `broker1` is 
`86*0.9+14*0.1=78.8`, and the scores of `broker2` is `14*0.9+86*0.1=21.2`, and 
`avg=(78.8+21.2)/2=50, 78.8>avg+10`. **so the algorithm will think that broker1 
need to shedding bundles to broker2 Again**. Then `broker1` shedding bundles 
corresponding to `(78.8-21.2)/2=28.8` cpu usage to `broker2`, that is no 
bundles loaded on `broker1` anymore! 
   - ......
   - After many rounds fo load balancing, we finally achieve stable state.
   
   In fact, we just need only one round of shedding. But the algorithm think 
that the load between these two brokers is not even incorrectly due to 
`historical scoring algorithm`. 
   the downside of `historical scoring algorithm` is so huge that we design a 
new proposal to handle with the performance fluctuations, and disable the 
`historical scoring algorithm`. Introduced in the next section `Multi Hit 
Algorithm`.
   
   
   
   ### Goal
   
   disable the `historical scoring algorithm`,  and introduce a new algorithm 
to handle with the performance fluctuations. 
   
   ### API Changes
   
   _No response_
   
   ### Implementation
   
   ## Multi Hit Algorithm
   The performance fluctuations is usual. For example, the cpu usage rate could 
increase by 20% suddenly, then it will fall back soon.  
   So we do not shedding bundles once there is any broker is judged to be 
overloaded, instead we count the number of **consecutive hits** of each broker. 
When the hit count of any broker is greater than configuration 
`HitCountThreshold`, we do shedding and reset the hit count to 0.  
   The default frequency of doing shedding is once per minutes. Say that we set 
`HitCountThreshold` to be 5, then we can deal with performance fluctuations 
lasting for 5 minutes.  
   But this will prolong the waiting time of load balancing when adding new 
brokers, that is we have to wait for 5 minutes to trigger load balancing when 
adding new brokers.   
   The solution is that, we could set two kind of `HitCountThreshold` - 
`HitCountThresholdForHigh` and `HitCountThresholdForLow`, and set two kind of 
`loadBalancerBrokerThresholdShedderPercentage` - `PercentageForHigh` and 
`PercentageForLow`.
   If the cpu usage of any broker exceeds the average cpu usage 
`PercentageForHigh` `HitCountThresholdForHigh` times, we will do shedding, 
similarly if the cpu usage of any broker exceeds the average cpu usage 
`PercentageForLow` `HitCountThresholdForLow` times, we also do shedding.  
   We can set `HitCountThresholdForHigh` to be 1, `PercentageForHigh` to be 40, 
`PercentageForLow` to be 10, `HitCountThresholdForLow` to be 5, so we can 
trigger the load balancing in the first round of shedding because the cpu usage 
of new broker is usually lower pretty much than the avg.
   
   ### Alternatives
   
   _No response_
   
   ### Anything else?
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to