[ https://issues.apache.org/jira/browse/HBASE-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13605847#comment-13605847 ]
Enis Soztutar commented on HBASE-8119: -------------------------------------- Quoting review at https://reviews.apache.org/r/9998/: Attaching a patch for improving the running time of StochasticLoadBalancer 200x times. TestStochasticLoadBalancer#testMidCluster() Current impl: //2013-03-15 17:28:25,495 DEBUG [main] balancer.StochasticLoadBalancer(256): Finished computing new laod balance plan. Computation took 172526ms to try 15000 different iterations. Found a solution that moves 600 regions; Going from a computed cost of 35.85000000000001 to a new cost of 23.481578947368426 With patch: //2013-03-18 14:56:13,541 DEBUG [Thread-2] balancer.StochasticLoadBalancer(436): Finished computing new laod balance plan. Computation took 941ms to try 15000 different iterations. Found a solution that moves 600 regions; Going from a computed cost of 35.85 to a new cost of 23.48157894736842 The improvements come from: - Optimized array based data structures in Cluster class - Getting rid of hashmaps - Optimized region move and swap ops - Removing most of the computation to cluster initialization, and state change for the cluster, thus eliminating computing the same results over and over - Some profiling There should be further optimizations but this should be a good start. If we ran into more problems, we can investigate further. There are a lof of TODO's added in this patch. I'll create a jira for collecting some thoughts, but I wont have the time to work on those for now. There are (hopefully) minor semantic changes in the algo. I had to bump up loadMultiplier, and decrease moveCostMultiplier. See comments at TestStochasticLoadBalancer#testLargeCluster(). Please review carefully. As noted in testLargeCluster(), this does not work for large clusters > 100000 regions, 1000 nodes. This can be solved by smt like http://en.wikipedia.org/wiki/Simulated_annealing instead of random walk with eager selection. > Optimize StochasticLoadBalancer > ------------------------------- > > Key: HBASE-8119 > URL: https://issues.apache.org/jira/browse/HBASE-8119 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 0.95.0 > Reporter: Enis Soztutar > Fix For: 0.95.0 > > > On a 5 node trunk cluster, I ran into a weird problem with > StochasticLoadBalancer: > server1 Thu Mar 14 03:42:50 UTC 2013 0.0 33 > server2 Thu Mar 14 03:47:53 UTC 2013 0.0 34 > server3 Thu Mar 14 03:46:53 UTC 2013 465.0 42 > server4 Thu Mar 14 03:47:53 UTC 2013 11455.0 282 > server5 Thu Mar 14 03:47:53 UTC 2013 0.0 34 > Total:5 11920 425 > Notice that server4 has 282 regions, while the others have much less. Plus > for one table with 260 regions has been super imbalanced: > {code} > Regions by Region Server > Region Server Region Count > http://server3:60030/ 10 > http://server4:60030/ 250 > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira