[GitHub] [hbase] sguggilam commented on a change in pull request #1755: HBASE-24069 Provide an ExponentialBackOffPolicy sleep between failed …

GitBox Tue, 26 May 2020 20:35:25 -0700


sguggilam commented on a change in pull request #1755:
URL: https://github.com/apache/hbase/pull/1755#discussion_r430780670




##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -1997,80 +2007,76 @@ private void unassign(final HRegionInfo region,
       }
       try {
         // Send CLOSE RPC
-        if (serverManager.sendRegionClose(server, region,
-          versionOfClosingNode, dest, transitionInZK)) {
-          LOG.debug("Sent CLOSE to " + server + " for region " +
-            region.getRegionNameAsString());
+        if (serverManager.sendRegionClose(server, region, 
versionOfClosingNode, dest,
+          transitionInZK)) {
+          LOG.debug("Sent CLOSE to " + server + " for region " + 
region.getRegionNameAsString());
           if (useZKForAssignment && !transitionInZK && state != null) {
             // Retry to make sure the region is
             // closed so as to avoid double assignment.
-            unassign(region, state, versionOfClosingNode,
-              dest, transitionInZK, src);
+            unassign(region, state, versionOfClosingNode, dest, 
transitionInZK, src);
           }
           return;
         }
         // This never happens. Currently regionserver close always return true.
         // Todo; this can now happen (0.96) if there is an exception in a 
coprocessor
-        LOG.warn("Server " + server + " region CLOSE RPC returned false for " +
-          region.getRegionNameAsString());
+        LOG.warn("Server " + server + " region CLOSE RPC returned false for "
+            + region.getRegionNameAsString());
       } catch (Throwable t) {
         long sleepTime = 0;
         Configuration conf = this.server.getConfiguration();
         if (t instanceof RemoteException) {
-          t = ((RemoteException)t).unwrapRemoteException();
+          t = ((RemoteException) t).unwrapRemoteException();
         }
         boolean logRetries = true;
-        if (t instanceof RegionServerAbortedException
-            || t instanceof RegionServerStoppedException
+        if (t instanceof RegionServerAbortedException || t instanceof 
RegionServerStoppedException
             || t instanceof ServerNotRunningYetException) {
           // RS is aborting or stopping, we cannot offline the region since 
the region may need
-          // to do WAL recovery. Until we see  the RS expiration, we should 
retry.
+          // to do WAL recovery. Until we see the RS expiration, we should 
retry.
           sleepTime = 1L + conf.getInt(RpcClient.FAILED_SERVER_EXPIRY_KEY,
             RpcClient.FAILED_SERVER_EXPIRY_DEFAULT);
 
         } else if (t instanceof NotServingRegionException) {
-          LOG.debug("Offline " + region.getRegionNameAsString()
-            + ", it's not any more on " + server, t);
+          LOG.debug(
+            "Offline " + region.getRegionNameAsString() + ", it's not any more 
on " + server, t);
           if (transitionInZK) {
             deleteClosingOrClosedNode(region, server);
           }
           if (state != null) {
             regionOffline(region);
           }
           return;
-        } else if ((t instanceof FailedServerException) || (state != null &&
-            t instanceof RegionAlreadyInTransitionException)) {
-          if (t instanceof FailedServerException) {
-            sleepTime = 1L + conf.getInt(RpcClient.FAILED_SERVER_EXPIRY_KEY,
+        } else if ((t instanceof FailedServerException)

Review comment:
       Yes,  there is no change in this section

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -2079,16 +2085,29 @@ private void unassign(final HRegionInfo region,
         }
 
         if (logRetries) {
-          LOG.info("Server " + server + " returned " + t + " for "
-            + region.getRegionNameAsString() + ", try=" + i
-            + " of " + this.maximumAttempts, t);
+          LOG.info("Server " + server + " returned " + t + " for " + 
region.getRegionNameAsString()
+              + ", try=" + i + " of " + this.maximumAttempts,
+            t);
           // Presume retry or server will expire.
         }
       }
     }
-    // Run out of attempts
-    if (state != null) {
-      regionStates.updateRegionState(region, State.FAILED_CLOSE);
+
+    long sleepTime = backoffPolicy.getBackoffTime(retryConfig,

Review comment:
       The idea is to use the exponential backoff configs 
"hbase.assignment.retry.sleep.initial" and 
"hbase.assignment.retry.sleep.initial" for backoff between retries as they can 
be exhausted pretty fast in case where the server is loaded /busy and cannot 
really even acknowledge the region close request from the master.  We need to 
use them to schedule the retry at a later point in a different thread 
asynchronouly
   
   The sleepTime is not really meant for this use case and is not reading any 
exponential backoff configs

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -1972,6 +1975,13 @@ private void unassign(final HRegionInfo region,
       final RegionState state, final int versionOfClosingNode,
       final ServerName dest, final boolean transitionInZK,
       final ServerName src) {
+    String encodedName = region.getEncodedName();
+    AtomicInteger failedCloseCount = failedCloseTracker.get(encodedName);
+    if (failedCloseCount == null) {
+      failedCloseCount = new AtomicInteger();
+      failedCloseTracker.put(encodedName, failedCloseCount);

Review comment:
       I agree, that's what we do even for failedOpenTracker. I can make the 
change for forceRegionStateToOffline() to take the lock before changing the 
state

##########
File path: 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java
##########
@@ -2079,16 +2085,29 @@ private void unassign(final HRegionInfo region,
         }
 
         if (logRetries) {
-          LOG.info("Server " + server + " returned " + t + " for "
-            + region.getRegionNameAsString() + ", try=" + i
-            + " of " + this.maximumAttempts, t);
+          LOG.info("Server " + server + " returned " + t + " for " + 
region.getRegionNameAsString()
+              + ", try=" + i + " of " + this.maximumAttempts,
+            t);
           // Presume retry or server will expire.
         }
       }
     }
-    // Run out of attempts
-    if (state != null) {
-      regionStates.updateRegionState(region, State.FAILED_CLOSE);
+
+    long sleepTime = backoffPolicy.getBackoffTime(retryConfig,
+      getFailedAttempts(encodedName, failedCloseTracker));
+    if (failedCloseCount.incrementAndGet() <= maximumAttempts && sleepTime > 
0) {
+      if (failedCloseTracker.containsKey(encodedName)) {
+        // Sleep before trying unassign if this region has failed to close 
before
+        scheduledThreadPoolExecutor.schedule(new DelayedUnAssignCallable(this, 
region, state,

Review comment:
       As mentioned, this mainly deals with the case where the RS is even busy 
to acknowledge the request from Master and tries to have a backoff before next 
retry to avoid the FAILED_CLOSE state if possible. In the case where we get 
NotServingRegionException, we still offline the region.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hbase] sguggilam commented on a change in pull request #1755: HBASE-24069 Provide an ExponentialBackOffPolicy sleep between failed …

Reply via email to