rmdmattingly opened a new pull request, #6593: URL: https://github.com/apache/hbase/pull/6593
See my design doc [here](https://docs.google.com/document/d/1jA8Ghs86v7b-53j5DcsdbPnOXxbHjewkIBFi1E4S1pY/edit?usp=sharing) To sum it up, the current load balancer isn't great for what it's supposed to do now, and it won't support all of the things that we'd like it to do in a perfect world. Right now: primary replica balancing squashes all other considerations. The default weight for one of the several cost functions that factor into primary replica balancing is 100,000. Meanwhile the default read request cost is 5. The result is that the load balancer, OOTB, basically doesn't care about balancing actual load. To solve this, you can either set primary replica balancing costs to zero, which is fine if you don't use read replicas, or — if you do use read replicas — maybe you can produce a magic incantation of configurations that work _just_ right, until your needs change. In the future: we'd like a lot more out of the balancer. System table isolation, meta table isolation, colocation of regions based on start key prefix similarity (this is a very rough idea atm, and not touched in the scope of this PR). And to support all of these features with either cost functions or RS groups would be a real burden. I think what I'm proposing here will be a much, much easier path for HBase operators. ## New features This PR introduces some new features: 1. Balancer conditional based replica distribution 2. System table isolation (put backups, quotas, etc on their own RegionServer (all sys tables on 1)) 3. Meta table isolation (put meta on its own RegionServer) These can be controlled via: - hbase.master.balancer.stochastic.conditionals.distributeReplicas: set this to true to enable conditional based replica distribution - hbase.master.balancer.stochastic.conditionals.isolateSystemTables: set this to true to enable system table isolation - hbase.master.balancer.stochastic.conditionals.isolateMetaTable: set this to true to enable meta table isolation - hbase.master.balancer.stochastic.additionalConditionals: much like cost functions, you can define your own RegionPlanConditional implementation and install them here ## Testing I wrote a lot of unit tests to validate the functionality here — both lightweight and some minicluster tests. Even in the most extreme cases (like, system table isolation + meta table isolation enabled on a 3 node cluster, or the number of read replicas == the number of servers) the balancer does what we'd expect. ### Replica Distribution Improvements #### Perfect primary and secondary replica distribution Not only does this PR offer an alternative means of distributing replicas, but it's actually a massive improvement on the existing approach. See [the Replica Distribution testing section of my design doc](https://docs.google.com/document/d/1jA8Ghs86v7b-53j5DcsdbPnOXxbHjewkIBFi1E4S1pY/edit?tab=t.0). Cost functions never successfully balance 3 replicas across 3 servers OOTB — but balancer conditionals do so expeditiously. To summarize the testing, we have `replicated_table`, a table with 3 region replicas. The 3 regions of a given replica share a color, and there are also 3 RegionServers in the cluster. We expect the balancer to evenly distribute one replica per server across the 3 RegionServers... **Cost functions don't work**:   **….omitting the meaningless snapshots between 4 and 27…**  At this point, I just exited the test because it was clear that our existing balancer would never achieve true replica distribution. But **balancer conditionals do work**:      #### Replica distribution performance improvements I've setup a large cluster test for conditional replica balancing, at an identical scale to the existing large cluster test for legacy replica balancing. It demonstrates a _significant_ improvement in balancer latency when dealing with 1k servers, 20k regions, 3 replicas per region, and 100 tables: <img width="515" alt="Screenshot 2025-01-04 at 11 48 55 AM" src="https://github.com/user-attachments/assets/15049da7-1e24-46af-971d-c22d2e07b8c5" /> ### New Features: Table Isolation Working as Designed See below where we ran a new unit test, TestLargerClusterBalancerConditionals, and tracked the locations of regions for 3 tables across 18 RegionServers: 1. 180 “product” table regions 1. 1 meta table region 1. 1 quotas table region All regions began on a single RegionServer, and within 4 balancer iterations we had a well balanced cluster, and isolation of key system tables. It achieved this in about 2min on my local machine, where most of that time was spent bootstrapping the mini cluster.     #### Table isolation performance testing Likewise, we created large tests for system table isolation, meta table isolation, multi table isolation, and multi table isolation + replica distribution. These tests reliably find exactly what we're looking for, and do so expeditiously on my local machine for 100 servers and 10k+ regions — all tests reliably pass within a few minutes. cc @ndimiduk @charlesconnell @ksravista @aalhour -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@hbase.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org