[ 
https://issues.apache.org/jira/browse/HBASE-28048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760627#comment-17760627
 ] 

Andrew Kyle Purtell commented on HBASE-28048:
---------------------------------------------

bq. Let's assume we are moving all regions from server A to server B. If server 
A is not reachable, and we fail all TRSP for region moves from A to B, the only 
alternative that the operator or software would be left with is stopping server 
A non-gracefully so that new SCP for server A can be processed by master.

We have this same problem for any region state transitions, any TRSP. 
[~vjasani] [~zhangduo] 

In some cases in our production we are seeing retries > 10 minutes to an 
unresponsive or dead regionserver. It's too much, too long. It cannot be 
required for an operator to step in every time to manually schedule a SCP for 
the unresponsive server. TRSP should abort itself, or the parent procedure of 
the TRSP should abort it, if the target server does not respond within a 
reasonable time bound. I am thinking 1 minute. The clock is ticking on the RIT 
while we are retrying RPCs to a unresponsive server. The time required to 
detect the server is unresponsive should be fairly short, so the total RIT time 
remains fairly short. 

To start with, TRSP should not retry for effectively infinite times. If the 
total retry time is more than a minute or two, it should give up. Then, 
depending on the region state, either another server is chosen or the 
unresponsive server is fenced and killed with a forced SCP, which grabs the 
lease on the RS WAL to split it, killing the RS as desired.

We should maybe also consider adding active probes and liveness checks and a 
predictive component (like Φ-accrual) for the master to identify sick or 
unresponsive regionservers before they impact production and fence and kill 
them proactively with SCP. 

> RSProcedureDispatcher to abort executing request after configurable retries
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-28048
>                 URL: https://issues.apache.org/jira/browse/HBASE-28048
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 3.0.0-alpha-4, 2.4.17, 2.5.5
>            Reporter: Viraj Jasani
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> In a recent incident, we observed that RSProcedureDispatcher continues 
> executing region open/close procedures with unbounded retries even in the 
> presence of known failures like GSS initiate failure:
>  
> {code:java}
> 2023-08-25 02:21:02,821 WARN [ispatcher-pool-40777] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=0, retrying... {code}
>  
>  
> If the remote execution results in IOException, the dispatcher attempts to 
> schedule the procedure for further retries:
>  
> {code:java}
>     private boolean scheduleForRetry(IOException e) {
>       LOG.debug("Request to {} failed, try={}", serverName, 
> numberOfAttemptsSoFar, e);
>       // Should we wait a little before retrying? If the server is starting 
> it's yes.
>       ...
>       ...
>       ...
>       numberOfAttemptsSoFar++;
>       // Add some backoff here as the attempts rise otherwise if a stuck 
> condition, will fill logs
>       // with failed attempts. None of our backoff classes -- RetryCounter or 
> ClientBackoffPolicy
>       // -- fit here nicely so just do something simple; increment by 
> rsRpcRetryInterval millis *
>       // retry^2 on each try
>       // up to max of 10 seconds (don't want to back off too much in case of 
> situation change).
>       submitTask(this,
>         Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>           10 * 1000),
>         TimeUnit.MILLISECONDS);
>       return true;
>     }
>  {code}
>  
>  
> Even though we try to provide backoff while retrying, max wait time is 10s:
>  
> {code:java}
> submitTask(this,
>   Math.min(rsRpcRetryInterval * (this.numberOfAttemptsSoFar * 
> this.numberOfAttemptsSoFar),
>     10 * 1000),
>   TimeUnit.MILLISECONDS); {code}
>  
>  
> This results in endless loop of retries, until either the underlying issue is 
> fixed (e.g. krb issue in this case) or regionserver is killed and the ongoing 
> open/close region procedure (and perhaps entire SCP) for the affected 
> regionserver is sidelined manually.
> {code:java}
> 2023-08-25 03:04:18,918 WARN  [ispatcher-pool-41274] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=217, retrying...
> 2023-08-25 03:04:18,916 WARN  [ispatcher-pool-41280] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=193, retrying...
> 2023-08-25 03:04:28,968 WARN  [ispatcher-pool-41315] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...
> 2023-08-25 03:04:28,969 WARN  [ispatcher-pool-41240] 
> procedure.RSProcedureDispatcher - request to rs1,61020,1692930044498 failed 
> due to java.io.IOException: Call to address=rs1:61020 failed on local 
> exception: java.io.IOException: 
> org.apache.hbase.thirdparty.io.netty.handler.codec.DecoderException: 
> org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): GSS 
> initiate failed, try=266, retrying...{code}
>  
> While external issues like "krb ticket expiry" requires operator 
> intervention, it is not prudent to fill up the active handlers with endless 
> retries while attempting to execute RPC on only single affected regionserver. 
> This eventually leads to overall cluster state degradation, specifically in 
> the event of multiple regionserver restarts resulting from any planned 
> activities.
> One of the resolutions here would be:
>  # Configure max retries as part of ExecuteProceduresRequest request (or it 
> could be part of RemoteProcedureRequest)
>  # This retry count should be used by RSProcedureDispatcher while scheduling 
> request failures for further retries
>  # After exhausting retries, mark the failure to the remote call, and bubble 
> up the failure to parent procedure.
> If the series of above mentioned calls result into aborting active master, we 
> should clearly log the FATAL/ERROR msg with the underlying root cause (e.g. 
> GSS initiate failure in this case), which can help operator to either fix the 
> krb ticket expiry or abort the regionserver, which would lead to SCP 
> performing the heavy task of WAL splitting recoveries, however this would not 
> prevent other procedures as well as active handlers from getting stuck 
> executing remote calls without any conditional termination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to