[ https://issues.apache.org/jira/browse/HBASE-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301614#comment-17301614 ]
Rahul Kumar commented on HBASE-24292: ------------------------------------- {code:java} // Check once-a-minute, with 30 retries if (rc == null) { rc = new RetryCounterFactory(30, 1000, 60_000).create(); // instead of rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create(); } else if (!rc.shouldRetry()) { return false; } {code} [~ndimiduk] [~anoop.hbase] How about changing the retry logic to above where we limit the number of retries? With this, in best case scenario it would wait for n(n+1)/2 and in worst case for n secs where n=number of retries. Looking back at branch-1 assignMeta logic for HMaster initialization, seems we used to wait for indefinite period of time there too i.e https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java#L2372 > A "stuck" master should not idle as active without taking action > ---------------------------------------------------------------- > > Key: HBASE-24292 > URL: https://issues.apache.org/jira/browse/HBASE-24292 > Project: HBase > Issue Type: Bug > Components: master, Region Assignment > Affects Versions: 2.3.0 > Reporter: Nick Dimiduk > Assignee: Rahul Kumar > Priority: Critical > > The master schedules a SCP for the region server hosting meta. However, due > to a misconfiguration, the cluster cannot make progress. After fixing the > configuration issue and restarting, the cluster still cannot make progress. > After the configured period (15 minuets), the master enters a "holding > pattern" where it retains Active master status, but isn't taking any action. > This "brown-out" state is toxic. It should either keep trying to make > progress, or it should abort. Staying up and not doing anything is the wrong > thing to do. -- This message was sent by Atlassian Jira (v8.3.4#803005)