David Manning created HBASE-25726:
-------------------------------------
Summary: MoveCostFunction is not included in the list of cost
functions for StochasticLoadBalancer
Key: HBASE-25726
URL: https://issues.apache.org/jira/browse/HBASE-25726
Project: HBase
Issue Type: Bug
Components: Balancer
Affects Versions: 2.4.0, 2.3.1, 3.0.0-alpha-1, 1.7.0
Reporter: David Manning
After OffPeakHours fix for MoveCostFunction (HBASE-24709), MoveCostFunction is
no longer included in costFunctions list. {{addCostFunction}} expects
multiplier to be non-zero, but multiplier is now only set in {{cost}} function.
As a result, {{hbase.master.balancer.stochastic.maxMovePercent}} is not
respected, and there is no cost function to oppose a move. Any move that
decreases total cost at all will be accepted, causing more churn and disruption
from balancer executions.
We noticed this when investigating a case where the balancer would run after a
regionserver was restarted without use of region_mover script. The regionserver
comes online with 0 regions, leading to a shortcut in {{needsBalance}} for
{{idleRegionServerExist}}. The balancer runs to move regions to that newly
restarted regionserver. However, it moves a large number of regions in the
cluster, hyper-optimizing the other cost variables. There were ~4300 regions in
the cluster at the time, so moving 25% of the regions should have had a final
cost of at least 7 (default moveCostFunction weight.) MoveCostFunction is also
not listed in the functions contributing to the initial cost.
{{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{47}}{{:}}{{43}}{{,}}{{396}}{{
}}{{INFO}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{]
}}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{
}}{{start}}{{}}{{StochasticLoadBalancer}}{{.}}{{balancer}}{{,
}}{{initCost}}{{=}}{{12}}{{.}}{{91377229840024}}{{,
}}{{functionCost}}{{=}}{{RegionCountSkewCostFunction}}{{ :
(}}{{500}}{{.}}{{0}}{{, }}{{0}}{{.}}{{014878672009326464}}{{);
}}{{TableSkewCostFunction}}{{ : (}}{{35}}{{.}}{{0}}{{,
}}{{0}}{{.}}{{013600280177445717}}{{); }}{{RegionReplicaHostCostFunction}}{{ :
(}}{{100000}}{{.}}{{0}}{{, }}{{0}}{{.}}{{0}}{{);
}}{{RegionReplicaRackCostFunction}}{{ : (}}{{10000}}{{.}}{{0}}{{,
}}{{0}}{{.}}{{0}}{{); }}{{ReadRequestCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{,
}}{{0}}{{.}}{{8296332203204705}}{{); }}{{WriteRequestCostFunction}}{{ :
(}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{06818455421617946}}{{);
}}{{MemstoreSizeCostFunction}}{{ : (}}{{5}}{{.}}{{0}}{{,
}}{{0}}{{.}}{{08132131691669181}}{{); }}{{StoreFileCostFunction}}{{ :
(}}{{5}}{{.}}{{0}}{{, }}{{0}}{{.}}{{02054620605193966}}{{);
}}{{computedMaxSteps}}{{: }}{{1000000}}
{{2021}}{{-}}{{03}}{{-}}{{30}}{{ }}{{15}}{{:}}{{48}}{{:}}{{13}}{{,}}{{385}}{{
}}{{DEBUG}}{{ [}}{{49187}}{{_}}{{ChoreService}}{{_}}{{3}}{{]
}}{{balancer}}{{.}}{{StochasticLoadBalancer}}{{ }}{{-}}{{ }}{{Finished
}}{{}}{{computing}}{{ }}{{new}}{{ }}{{load}}{{ }}{{balance}}{{
}}{{plan}}{{.}}{{ }}{{Computation}}{{ }}{{took}}{{ }}{{30004ms}}{{ }}{{to}}{{
}}{{try}}{{ }}{{6571}}{{ }}{{different}}{{ }}{{iterations}}{{.}}{{
}}{{Found}}{{ }}{{a }}{{}}{{solution}}{{ }}{{that}}{{ }}{{moves}}{{
}}{{1095}}{{ }}{{regions}}{{; }}{{Going}}{{ }}{{from}}{{ }}{{a}}{{
}}{{computed}}{{ }}{{cost}}{{ }}{{of}}{{ }}{{12}}{{.}}{{91377229840024}}{{
}}{{to}}{{ }}{{a}}{{ }}{{new}}{{ }}{{cost}}{{ }}{{of
}}{{}}{{4}}{{.}}{{804625730746651}}{{}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)