Frens Jan Rumph created HBASE-28820: ---------------------------------------
Summary: TableSkew cost scales beyond 1 Key: HBASE-28820 URL: https://issues.apache.org/jira/browse/HBASE-28820 Project: HBase Issue Type: Bug Components: Balancer Affects Versions: 2.5.7 Environment: We experienced the issue with Apache HBase 2.5.7 on Apache Hadoop 3.3.6 using Java 17 on Debian 12 (Bookworm). Reporter: Frens Jan Rumph This may already be covered by later releases, but we noticed that the table skew cost function can produce cost values beyond 1. In our case with over 1000 tables caused the table skew cost to suppress the region count skew (and other) cost functions. I think this is because the cost per table are 'simply' summed in TableSkewCostFunction#cost. So if the number of tables with skew is large, this cost function may cause the balancer to favour actions that decrease this cost at to big of an expense of other costs such as region count skew. Logging from the HBase master that shows this: {code:java} [...] balancer.StochasticLoadBalancer: dBalancer.balancer, initial weighted average imbalance=0.25500371101846336, functionCost=RegionCountSkewCostFunction : (multiplier=100000.0, imbalance=0.24272066309658274, need balance); PrimaryRegionCountSkewCostFunction : (not needed); MoveCostFunction : (multiplier=7.0, imbalance=0.0); ServerLocalityCostFunction : (multiplier=25.0, imbalance=0.6022498608833904, need balance); RackLocalityCostFunction : (multiplier=15.0, imbalance=0.0); TableSkewCostFunction : (multiplier=35.0, imbalance=35.24784226006047, need balance); RegionReplicaHostCostFunction : (not needed); RegionReplicaRackCostFunction : (not needed); ReadRequestCostFunction : (multiplier=5.0, imbalance=0.24057323733439073, need balance); WriteRequestCostFunction : (multiplier=5.0, imbalance=0.3233739875438904, need balance); MemStoreSizeCostFunction : (multiplier=5.0, imbalance=0.3195880383071082, need balance); StoreFileCostFunction : (multiplier=5.0, imbalance=0.23335375436276784, need balance); computedMaxSteps=1000000 {code} Note the {{TableSkewCostFunction : (multiplier=35.0, imbalance=35.24784226006047)}} part. In order to work-around this we temporarily reduced the multiplier of the table skew cost function to 0. The test case below fails on HBase the 2.5 and 2.6 branches. It simply assigns two tables with two regions each to a single server. {code:java} @Test public void testTableSkewCost() { TableName t1 = TableName.valueOf("t1"); TableName t2 = TableName.valueOf("t2"); TreeMap<ServerName, List<RegionInfo>> clusterState = new TreeMap<>(); clusterState.put(ServerName.valueOf("n1", 16020,0), Arrays.asList( RegionInfoBuilder.newBuilder(t1).setRegionId(11).build(), RegionInfoBuilder.newBuilder(t1).setRegionId(12).build() )); clusterState.put(ServerName.valueOf("n2", 16020,0), Arrays.asList( RegionInfoBuilder.newBuilder(t2).setRegionId(21).build(), RegionInfoBuilder.newBuilder(t2).setRegionId(22).build() )); BalancerClusterState cluster = new BalancerClusterState(clusterState, null, null, null); Configuration conf = HBaseConfiguration.create(); CostFunction costFunction = new TableSkewCostFunction(conf); costFunction.prepare(cluster); double cost = costFunction.cost(); assertTrue(cost >= 0); assertTrue(cost <= 1.01); } {code} It's the second assertion that fails since the computed cost for this cluster state is 2. I guess none of the existing cluster state test/mock configurations have a real table skew. -- This message was sent by Atlassian Jira (v8.20.10#820010)