Viraj Jasani created HBASE-25460:
------------------------------------
Summary: Expose drainingServers as cluster metric
Key: HBASE-25460
URL: https://issues.apache.org/jira/browse/HBASE-25460
Project: HBase
Issue Type: New Feature
Reporter: Viraj Jasani
Due to some reason, we had significantly high number of servers put in
decommissioned mode and for significant time, they continued being in the same
state serving no regions at all. This put heavy load on rest of live servers
and it was too late before one could recognize the issues with improper
balancing of the cluster. The cluster was imbalanced to the point where SLB was
not balancing the cluster until one turns on
*_hbase.master.balancer.stochastic.runMaxSteps_* because calculated steps were
too high. And as expected, such balancing brings up sudden spike of RITs
immediately.
Although running into such situation is rare, we can take some precautions by
exposing metric. We should expose list of draining RegionServers as jmx metrics
just like we expose _*liveRegionServers*_ and _*deadRegionServers*_. Such
metric can help configure alerts with threshold on % of total RS that are
allowed to go in draining mode (e.g during rolling upgrades) in any
circumstances.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)