Viraj Jasani created HBASE-28048:
------------------------------------

             Summary: RSProcedureDispatcher to abort executing request after 
configurable retries
                 Key: HBASE-28048
                 URL: https://issues.apache.org/jira/browse/HBASE-28048
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.5.5, 2.4.17, 3.0.0-alpha-4
            Reporter: Viraj Jasani
             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


In a recent incident, we observed that RSProcedureDispatcher continues 
executing region open/close procedures with unbounded retries even in the 
presence of known failures like GSS initiate failure:

 
{code:java}
2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=0, retrying... {code}
 

 

If the remote execution results in IOException, the dispatcher attempts to 
schedule the procedure for further retries:

 
{code:java}
    private boolean scheduleForRetry(IOException e) {
      LOG.debug("Request to {} failed, try={}", serverName, 
numberOfAttemptsSoFar, e);
      // Should we wait a little before retrying? If the server is starting 
it's yes.
      ...
      ...
      ...
      numberOfAttemptsSoFar++;
      // Add some backoff here as the attempts rise otherwise if a stuck 
condition, will fill logs
      // with failed attempts. None of our backoff classes -- RetryCounter or 
ClientBackoffPolicy
      // -- fit here nicely so just do something simple; increment by 
rsRpcRetryInterval millis *
      // retry^2 on each try
      // up to max of 10 seconds (don't want to back off too much in case of 
situation change).
      submitTask(this,
        Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
          10 * 1000),
        TimeUnit.MILLISECONDS);
      return true;
    }
 {code}
 

 

Even though we try to provide backoff while retrying, max wait time is 10s:

 
{code:java}
submitTask(this,
  Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
this.numberOfAttemptsSoFar),
    10 * 1000),
  TimeUnit.MILLISECONDS); {code}
 

 

This results in endless loop of retries, until either the underlying issue is 
fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
open/close region procedure (and perhaps entire SCP) for the affected 
regionserver is sidelined manually.
{code:java}
2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=217, retrying...
2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=193, retrying...
2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...
2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed due 
to java.io.IOException: Call to address=rs1:61020 failed on local exception: 
java.io.IOException: 
org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
initiate failed, try=266, retrying...{code}
 

While external issues like "krb ticket expiry" requires operator intervention, 
it is not prudent to fill up the active handlers with endless retries while 
attempting to execute RPC on only single affected regionserver. This eventually 
leads to overall cluster state degradation, specifically in the event of 
multiple regionserver restarts resulting from any planned activities.

One of the resolutions here would be:
 # Configure max retries as part of ExecuteProceduresRequest request (or it 
could be part of RemoteProcedureRequest)
 # This retry count should be used by RSProcedureDispatcher while scheduling 
request failures for further retries
 # After exhausting retries, mark the failure to the remote call, and bubble up 
the failure to parent procedure.

If the series of above mentioned calls result into aborting active master, we 
should clearly log the FATAL/ERROR msg with the underlying root cause (e.g. GSS 
initiate failure in this case), which can help operator to either fix the krb 
ticket expiry or abort the regionserver, which would lead to SCP performing the 
heavy task of WAL splitting recoveries, however this would not prevent other 
procedures as well as active handlers from getting stuck executing remote calls 
without any conditional termination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to