[ 
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Dimiduk updated HBASE-28192:
---------------------------------
        Fix Version/s:     (was: 2.7.0)
                           (was: 3.0.0-beta-2)
                           (was: 2.6.1)
                           (was: 2.5.11)
    Affects Version/s: 3.0.0-beta-1
                       2.6.0

Folding fix versions into affects versions.

> Master should recover if meta region state is inconsistent
> ----------------------------------------------------------
>
>                 Key: HBASE-28192
>                 URL: https://issues.apache.org/jira/browse/HBASE-28192
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.6.0, 2.4.17, 2.5.6, 3.0.0-beta-1
>            Reporter: Viraj Jasani
>            Assignee: Viraj Jasani
>            Priority: Major
>
> During active master initialization, before we set master as active (i.e. 
> {_}setInitialized(true){_}), we need both meta and namespace regions online. 
> If the region state of meta or namespace is inconsistent, active master can 
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
>   RetryCounter rc = null;
>   while (!isStopped()) {
> ...
> ...
> ...
>     // Check once-a-minute.
>     if (rc == null) {
>       rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
>     }
>     Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
>   }
>   return false;
> }
>  {code}
> In one of the recent outage, we observed that meta was online on a server, 
> which was correctly reflected in meta znode, but the server starttime was 
> different. This means that as per the latest transition record, meta was 
> marked online on old server (same server with old start time). This kept 
> active master initialization waiting forever and some SCPs got stuck in 
> initial stage where they need to access meta table before getting candidate 
> for region moves.
> The only way out of this outage is for operator to schedule recoveries using 
> hbck for old server, which triggers SCP for old server address of meta. Since 
> many SCPs were stuck, the processing of new SCP too was taking some time and 
> manual restart of active master triggered failover, and new master was able 
> to complete SCP for old meta server, correcting the meta assignment details, 
> which eventually marked master as active and only after this, we were able to 
> see real large num of RITs that were hidden so far.
> We need to let master recover from this state to avoid manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to