[ 
https://issues.apache.org/jira/browse/HBASE-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301614#comment-17301614
 ] 

Rahul Kumar commented on HBASE-24292:
-------------------------------------

{code:java}
      // Check once-a-minute, with 30 retries
      if (rc == null) {
        rc = new RetryCounterFactory(30, 1000, 60_000).create(); // instead of 
rc = new RetryCounterFactory(Integer.MAX_VALUE, 1000, 60_000).create();
      }
      else if (!rc.shouldRetry()) {
          return false;
      }
{code}
[~ndimiduk]  [~anoop.hbase]  How about changing the retry logic to above where 
we limit the number of retries? With this, in best case scenario it would wait 
for n(n+1)/2 and in worst case for n secs where n=number of retries. Looking 
back at branch-1 assignMeta logic for HMaster initialization, seems we used to 
wait for indefinite period of time there too i.e 
https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java#L2372

> A "stuck" master should not idle as active without taking action
> ----------------------------------------------------------------
>
>                 Key: HBASE-24292
>                 URL: https://issues.apache.org/jira/browse/HBASE-24292
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Assignee: Rahul Kumar
>            Priority: Critical
>
> The master schedules a SCP for the region server hosting meta. However, due 
> to a misconfiguration, the cluster cannot make progress. After fixing the 
> configuration issue and restarting, the cluster still cannot make progress. 
> After the configured period (15 minuets), the master enters a "holding 
> pattern" where it retains Active master status, but isn't taking any action.
> This "brown-out" state is toxic. It should either keep trying to make 
> progress, or it should abort. Staying up and not doing anything is the wrong 
> thing to do.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to