Re: [PR] HBASE-28215: region reopen procedure batching/throttling [hbase]

via GitHub Mon, 27 Nov 2023 10:56:27 -0800


rmdmattingly commented on code in PR #5534:
URL: https://github.com/apache/hbase/pull/5534#discussion_r1406613210



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ReopenTableRegionsProcedure.java:
##########
@@ -139,33 +170,57 @@ protected Flow executeFromState(MasterProcedureEnv env, 
ReopenTableRegionsState
       case REOPEN_TABLE_REGIONS_CONFIRM_REOPENED:
         regions = 
regions.stream().map(env.getAssignmentManager().getRegionStates()::checkReopened)
           .filter(l -> l != null).collect(Collectors.toList());
-        if (regions.isEmpty()) {
-          return Flow.NO_MORE_STATE;
+        // we need to create a set of region names because the HRegionLocation 
hashcode is only
+        // based
+        // on the server name
+        Set<byte[]> currentRegionBatchNames = currentRegionBatch.stream()
+          .map(r -> r.getRegion().getRegionName()).collect(Collectors.toSet());
+        currentRegionBatch = regions.stream()
+          .filter(r -> 
currentRegionBatchNames.contains(r.getRegion().getRegionName()))
+          .collect(Collectors.toList());
+        if (currentRegionBatch.isEmpty()) {
+          if (regions.isEmpty()) {
+            return Flow.NO_MORE_STATE;
+          } else {
+            
setNextState(ReopenTableRegionsState.REOPEN_TABLE_REGIONS_REOPEN_REGIONS);
+            if (reopenBatchBackoffMillis > 0) {
+              backoff(reopenBatchBackoffMillis);
+            }
+            return Flow.HAS_MORE_STATE;
+          }
         }
-        if (regions.stream().anyMatch(loc -> canSchedule(env, loc))) {
+        if (currentRegionBatch.stream().anyMatch(loc -> canSchedule(env, 
loc))) {
           retryCounter = null;
           
setNextState(ReopenTableRegionsState.REOPEN_TABLE_REGIONS_REOPEN_REGIONS);
+          if (reopenBatchBackoffMillis > 0) {
+            backoff(reopenBatchBackoffMillis);
+          }
           return Flow.HAS_MORE_STATE;
         }
         // We can not schedule TRSP for all the regions need to reopen, wait 
for a while and retry
         // again.
         if (retryCounter == null) {
           retryCounter = 
ProcedureUtil.createRetryCounter(env.getMasterConfiguration());
         }
-        long backoff = retryCounter.getBackoffTimeAndIncrementAttempts();
+        long backoffMillis = retryCounter.getBackoffTimeAndIncrementAttempts();
         LOG.info(
-          "There are still {} region(s) which need to be reopened for table {} 
are in "
+          "There are still {} region(s) which need to be reopened for table 
{}. {} are in "
             + "OPENING state, suspend {}secs and try again later",
-          regions.size(), tableName, backoff / 1000);
-        setTimeout(Math.toIntExact(backoff));
-        setState(ProcedureProtos.ProcedureState.WAITING_TIMEOUT);
-        skipPersistence();
+          regions.size(), tableName, currentRegionBatch.size(), backoffMillis 
/ 1000);
+        backoff(backoffMillis);
         throw new ProcedureSuspendedException();
       default:
         throw new UnsupportedOperationException("unhandled state=" + state);
     }
   }
 
+  private void backoff(long millis) throws ProcedureSuspendedException {
+    setTimeout(Math.toIntExact(millis));
+    setState(ProcedureProtos.ProcedureState.WAITING_TIMEOUT);
+    skipPersistence();

Review Comment:
   In what event do we rely on the persisted state? Do we rely on it when 
moving between states, or is it only relevant when recovering from an hmaster 
roll or something? I agree with @bbeaudreault that we don't reevaluate the 
`regions` in `REOPEN_TABLE_REGIONS_REOPEN_REGIONS`, so if we rely on the 
persisted state to decide which regions to transition after a batch backoff 
then I think this would stick us in an infinite loop. e.g.,:
   1. `this.regions = [a, b, c, d, e]`, `batchSize=1`, `batchBackoffMs=1`
   2. enter `..._REOPEN_REGIONS` with `this.regions = [a, b, c, d, e]`, derive 
currentBatch of `a` and kick off TRSP
   3. enter `..._CONFIRM_REOPENED`, evaluate `this.regions = [b, c, d, e]` 
because a has reopened, acknowledge the success by setting current state to 
`WAITING_TIMEOUT`, next state back to `..._REOPEN_REGIONS` & throw 
`ProcedureSuspendedException` without persistence of the new this.regions value
   4. `..._REOPEN_REGIONS` would be loaded from the persisted state with 
`this.regions = [a, b, c, d, e]`, so we'd be back at step 2 and destined to 
repeat this cycle
   
   To avoid this we could just persist our state if the current batch has no 
regions still in transition. All this said, I feel like the above example 
misunderstands how proc state is used because otherwise I don't see how [this 
test](https://github.com/apache/hbase/blob/a3a880439fa553f0b96973bb051884c4a4998ea7/hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestReopenTableRegionsProcedureBatchBackoff.java#L74-L88)
 would pass. It succeeds which pretty clearly indicates the lack of an infinite 
loop
   
   re: 
   > I do not think we should keep increase the retry number and increase the 
retry interval while scheduling TRSPs
   
   We will only increase the retry interval and backoff if our current batch 
has yet to successfully reopen — the interval will remain fixed and will not 
increment retry counts if we're successfully moving onto the next batch of 
regions



##########
hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ReopenTableRegionsProcedure.java:
##########
@@ -139,33 +170,57 @@ protected Flow executeFromState(MasterProcedureEnv env, 
ReopenTableRegionsState
       case REOPEN_TABLE_REGIONS_CONFIRM_REOPENED:
         regions = 
regions.stream().map(env.getAssignmentManager().getRegionStates()::checkReopened)
           .filter(l -> l != null).collect(Collectors.toList());
-        if (regions.isEmpty()) {
-          return Flow.NO_MORE_STATE;
+        // we need to create a set of region names because the HRegionLocation 
hashcode is only
+        // based
+        // on the server name
+        Set<byte[]> currentRegionBatchNames = currentRegionBatch.stream()
+          .map(r -> r.getRegion().getRegionName()).collect(Collectors.toSet());
+        currentRegionBatch = regions.stream()
+          .filter(r -> 
currentRegionBatchNames.contains(r.getRegion().getRegionName()))
+          .collect(Collectors.toList());
+        if (currentRegionBatch.isEmpty()) {
+          if (regions.isEmpty()) {
+            return Flow.NO_MORE_STATE;
+          } else {
+            
setNextState(ReopenTableRegionsState.REOPEN_TABLE_REGIONS_REOPEN_REGIONS);
+            if (reopenBatchBackoffMillis > 0) {
+              backoff(reopenBatchBackoffMillis);
+            }
+            return Flow.HAS_MORE_STATE;
+          }
         }
-        if (regions.stream().anyMatch(loc -> canSchedule(env, loc))) {
+        if (currentRegionBatch.stream().anyMatch(loc -> canSchedule(env, 
loc))) {
           retryCounter = null;
           
setNextState(ReopenTableRegionsState.REOPEN_TABLE_REGIONS_REOPEN_REGIONS);
+          if (reopenBatchBackoffMillis > 0) {
+            backoff(reopenBatchBackoffMillis);
+          }
           return Flow.HAS_MORE_STATE;
         }
         // We can not schedule TRSP for all the regions need to reopen, wait 
for a while and retry
         // again.
         if (retryCounter == null) {
           retryCounter = 
ProcedureUtil.createRetryCounter(env.getMasterConfiguration());
         }
-        long backoff = retryCounter.getBackoffTimeAndIncrementAttempts();
+        long backoffMillis = retryCounter.getBackoffTimeAndIncrementAttempts();
         LOG.info(
-          "There are still {} region(s) which need to be reopened for table {} 
are in "
+          "There are still {} region(s) which need to be reopened for table 
{}. {} are in "
             + "OPENING state, suspend {}secs and try again later",
-          regions.size(), tableName, backoff / 1000);
-        setTimeout(Math.toIntExact(backoff));
-        setState(ProcedureProtos.ProcedureState.WAITING_TIMEOUT);
-        skipPersistence();
+          regions.size(), tableName, currentRegionBatch.size(), backoffMillis 
/ 1000);
+        backoff(backoffMillis);
         throw new ProcedureSuspendedException();
       default:
         throw new UnsupportedOperationException("unhandled state=" + state);
     }
   }
 
+  private void backoff(long millis) throws ProcedureSuspendedException {
+    setTimeout(Math.toIntExact(millis));
+    setState(ProcedureProtos.ProcedureState.WAITING_TIMEOUT);
+    skipPersistence();

Review Comment:
   In what event do we rely on the persisted state? Do we rely on it when 
moving between states, or is it only relevant when recovering from an hmaster 
roll or something? I agree with @bbeaudreault that we don't reevaluate the 
`regions` in `REOPEN_TABLE_REGIONS_REOPEN_REGIONS`, so if we rely on the 
persisted state to decide which regions to transition after a batch backoff 
then I think this would stick us in an infinite loop. e.g.,:
   1. `this.regions = [a, b, c, d, e]`, `batchSize=1`, `batchBackoffMs=1`
   2. enter `..._REOPEN_REGIONS` with `this.regions = [a, b, c, d, e]`, derive 
currentBatch of `a` and kick off TRSP
   3. enter `..._CONFIRM_REOPENED`, evaluate `this.regions = [b, c, d, e]` 
because a has reopened, acknowledge the success by setting current state to 
`WAITING_TIMEOUT`, next state back to `..._REOPEN_REGIONS` & throw 
`ProcedureSuspendedException` without persistence of the new this.regions value
   4. `..._REOPEN_REGIONS` would be loaded from the persisted state with 
`this.regions = [a, b, c, d, e]`, so we'd be back at step 2 and destined to 
repeat this cycle
   
   To avoid this we could just persist our state if the current batch has no 
regions still in transition. All this said, I feel like the above example 
misunderstands how proc state is used because otherwise I don't see how [this 
test](https://github.com/apache/hbase/blob/a3a880439fa553f0b96973bb051884c4a4998ea7/hbase-server/src/test/java/org/apache/hadoop/hbase/master/procedure/TestReopenTableRegionsProcedureBatchBackoff.java#L74-L88)
 would pass. It succeeds which pretty clearly indicates the lack of an infinite 
loop
   
   re: 
   > I do not think we should keep increase the retry number and increase the 
retry interval while scheduling TRSPs
   
   We will only increase the retry interval and backoff if our current batch 
has yet to successfully reopen — the interval will remain fixed and will not 
increment retry counts if we're successfully moving onto the next batch of 
regions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] HBASE-28215: region reopen procedure batching/throttling [hbase]

Reply via email to