[
https://issues.apache.org/jira/browse/HBASE-28192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nick Dimiduk updated HBASE-28192:
---------------------------------
Fix Version/s: (was: 2.7.0)
(was: 3.0.0-beta-2)
(was: 2.6.1)
(was: 2.5.11)
Affects Version/s: 3.0.0-beta-1
2.6.0
Folding fix versions into affects versions.
> Master should recover if meta region state is inconsistent
> ----------------------------------------------------------
>
> Key: HBASE-28192
> URL: https://issues.apache.org/jira/browse/HBASE-28192
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.6.0, 2.4.17, 2.5.6, 3.0.0-beta-1
> Reporter: Viraj Jasani
> Assignee: Viraj Jasani
> Priority: Major
>
> During active master initialization, before we set master as active (i.e.
> {_}setInitialized(true){_}), we need both meta and namespace regions online.
> If the region state of meta or namespace is inconsistent, active master can
> get stuck in the initialization step:
> {code:java}
> private boolean isRegionOnline(RegionInfo ri) {
> RetryCounter rc = null;
> while (!isStopped()) {
> ...
> ...
> ...
> // Check once-a-minute.
> if (rc == null) {
> rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
> }
> Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
> }
> return false;
> }
> {code}
> In one of the recent outage, we observed that meta was online on a server,
> which was correctly reflected in meta znode, but the server starttime was
> different. This means that as per the latest transition record, meta was
> marked online on old server (same server with old start time). This kept
> active master initialization waiting forever and some SCPs got stuck in
> initial stage where they need to access meta table before getting candidate
> for region moves.
> The only way out of this outage is for operator to schedule recoveries using
> hbck for old server, which triggers SCP for old server address of meta. Since
> many SCPs were stuck, the processing of new SCP too was taking some time and
> manual restart of active master triggered failover, and new master was able
> to complete SCP for old meta server, correcting the meta assignment details,
> which eventually marked master as active and only after this, we were able to
> see real large num of RITs that were hidden so far.
> We need to let master recover from this state to avoid manual intervention.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)