315157973 opened a new issue, #19806:
URL: https://github.com/apache/pulsar/issues/19806

   ### Motivation
   
   With all existing Pulsar load balancing algorithms, it is difficult to 
balance the load of Pulsar cluster nodes. It often happens that some nodes are 
highly loaded while others are idle, and the CPU of each Broker node is very 
different.
   
![image](https://user-images.githubusercontent.com/9758905/224992293-8074379f-eaca-4052-a563-4f0a95eda121.png)
   There are three reasons why the existing way will make the cluster 
unbalanced:
   1. The load managed by each Bundle is not even. Even if the number of 
partitions managed by each bundle is the same, there is no guarantee that the 
sum of the loads of these partitions will be the same.
   2. Doesn't shed loads very well. The existing default policy 
ThresholdShedder has a relatively high usage threshold, and various traffic 
thresholds need to be set. Many clusters with high TPS and small message bodies 
may have high CPU but low traffic; And for many small-scale clusters, the 
threshold needs to be modified according to the actual business.
   3. The removed Bundle cannot be well distributed to other Brokers. The load 
information of each Broker will be reported at regular intervals, so the 
judgment of the Leader Broker when allocating Bundles cannot be guaranteed to 
be completely correct. Secondly, if there are a large number of Bundles to be 
redistributed, the Leader may make the low-load Broker a new high-load node 
when the load information is not up-to-date.
   
   ### Goal
   
   We reimplemented the algorithm in 3 parts and listed our test report at the 
end:
   1. Assigning partitions to Bundle.
   2. Allocating Bundle to Broker.
   3. The algorithm for the leader to decide which Bundles on the high-load 
Broker to offload.
   
   In this PIP, we only discuss the first one: `Assigning partitions to Bundle`.
   Let each bundle carry the same load as possible.
   
   ### API Changes
   
   1.  Add a configuration item `partitionAssignerClassName`, so that different 
partition assignment algorithms can be dynamically configured.
   The existing algorithm will be used as the default 
`partitionAssignerClassName=ConsistentHashingPartitionAssigner`
   3. Implement a new partition assignment class `RoundRobinPartitionAssigner`
   
   ### Implementation
   
   The client sends a message to a multi-partition Topic, which uses polling by 
default. 
   Therefore, we believe that the load of partitions of the same topic is 
balanced.
   We  assign partitions of the same topic to bundle by round-robin. 
   In this way, the difference in the number of partitions carried by the 
bundle will not exceed 1. 
   Since we consider the load of each partition of the same topic to be 
balanced, the load carried by each bundle is also balanced.
   <img 
src="https://user-images.githubusercontent.com/9758905/224997861-9bb9e1ef-1439-4f67-a3f7-53d9b6e1e5da.png";
 width="500" height="500">
   Operation steps:
   1. Partition 0 finds a starting bundle through the consistent hash 
algorithm, assuming it is bundle0, we start from this bundle
   2. By round-robin, assign partition 1 to the next bundle1, assign partition 
2 to the next bundle2, and so on
   
   If the number of partitions is less than the number of bundles, will some 
bundles have a high load?
   Since the starting bundle is determined by consistent hashing, the starting 
point of each topic is different, which can prevent the earlier bundles from 
becoming hotspots.
   
   When the number of bundles changes, will all partitions be reassigned?
   Only when the number of bundles change, all partitions under the same 
namespace will be reassigned. 
   Changing the number of broker or partitions,  will not trigger reassignment.
   We usually only split when there is a hot bundle. 
   The current partition assign method makes the load of each bundle 
approximately balanced, so the bundle split will not be triggered unless it is 
artificially split.
   Of course, we have also tested the time-consuming of assigning all 
partitions in the entire namespace in the worst case.
   
   Test scenario: 6 * 4C32GB Brokers, the CPU water level of each broker is at 
60%, and 50,000 partitions under the namespace are assigned at the same time. 
It takes about 30s.
   
![image](https://user-images.githubusercontent.com/9758905/225000884-224fb9d0-524a-4bf2-8d61-c78222218b8a.png)
   
   
   
   
   ### Security Considerations
   
   None
   
   ### Alternatives
   
   _No response_
   
   ### Anything else?
   
   Test report
   We tested several scenarios, each using the three algorithms we mentioned 
above.
   And every node looks well balanced.
   Machine: 4C32G * 6
   Test scenario: 200 partitions, after the cluster is stable, restart one of 
them
   
![image](https://user-images.githubusercontent.com/9758905/225001614-33f8321d-bdf2-4085-87bb-0ca8a89f70ac.png)
   
   
   Test scenario: Restart multiple nodes in a loop, and observe the final 
cluster load
   
![image](https://user-images.githubusercontent.com/9758905/225001846-39a61424-794b-4a79-80a9-d533af53d389.png)
   
   Test scenario: Add a new Broker
   <img width="757" alt="QQ20230314-204213@2x" 
src="https://user-images.githubusercontent.com/9758905/225004459-25f282da-38ef-4af2-be00-b55616d8a57c.png";>
   
   Test scenario: single-partition topic, even unloading the bundle will make 
the receiving broker to be a new hotspot, observe whether the algorithm will 
unload frequently
   
![image](https://user-images.githubusercontent.com/9758905/225005595-785b168d-d314-425c-b1ba-a80f4840ff60.png)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to