[ 
https://issues.apache.org/jira/browse/HBASE-25460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HBASE-25460:
---------------------------------
    Description: 
Due to some reason, we had significantly high number of servers put in 
decommissioned mode and for significant time, they continued being in the same 
state serving no regions at all. This put heavy load on rest of live servers 
and it was too late before one could recognize the issues with improper 
balancing of the cluster. And as expected, balancing such cluster with/without 
*runMaxSteps* can bring up sudden spike of RITs in proportion to the degree of 
imbalanced regions in the cluster.

Although running into such situation is rare, we can take some precautions by 
exposing metric. We should expose list of draining RegionServers as jmx metrics 
just like we expose _*liveRegionServers*_ and _*deadRegionServers*_. Such 
metric can help configure alerts with threshold on % of total RS that are 
allowed to go in draining mode (e.g during rolling upgrades) in any 
circumstances.

  was:
Due to some reason, we had significantly high number of servers put in 
decommissioned mode and for significant time, they continued being in the same 
state serving no regions at all. This put heavy load on rest of live servers 
and it was too late before one could recognize the issues with improper 
balancing of the cluster. The cluster was imbalanced to the point where SLB was 
not balancing the cluster until one turns on 
*_hbase.master.balancer.stochastic.runMaxSteps_* because calculated steps were 
too high. And as expected, such balancing brings up sudden spike of RITs 
immediately.

Although running into such situation is rare, we can take some precautions by 
exposing metric. We should expose list of draining RegionServers as jmx metrics 
just like we expose _*liveRegionServers*_ and _*deadRegionServers*_. Such 
metric can help configure alerts with threshold on % of total RS that are 
allowed to go in draining mode (e.g during rolling upgrades) in any 
circumstances.


> Expose drainingServers as cluster metric
> ----------------------------------------
>
>                 Key: HBASE-25460
>                 URL: https://issues.apache.org/jira/browse/HBASE-25460
>             Project: HBase
>          Issue Type: New Feature
>    Affects Versions: 1.6.0
>            Reporter: Viraj Jasani
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0
>
>
> Due to some reason, we had significantly high number of servers put in 
> decommissioned mode and for significant time, they continued being in the 
> same state serving no regions at all. This put heavy load on rest of live 
> servers and it was too late before one could recognize the issues with 
> improper balancing of the cluster. And as expected, balancing such cluster 
> with/without *runMaxSteps* can bring up sudden spike of RITs in proportion to 
> the degree of imbalanced regions in the cluster.
> Although running into such situation is rare, we can take some precautions by 
> exposing metric. We should expose list of draining RegionServers as jmx 
> metrics just like we expose _*liveRegionServers*_ and _*deadRegionServers*_. 
> Such metric can help configure alerts with threshold on % of total RS that 
> are allowed to go in draining mode (e.g during rolling upgrades) in any 
> circumstances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to