[
https://issues.apache.org/jira/browse/HBASE-26955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17553649#comment-17553649
]
Duo Zhang commented on HBASE-26955:
-----------------------------------
In RemoteProcedureResultReporter, we reset the tries because of we have
successfully done a rpc call.
And I do not fully understand the problem here, you want to reset the sleep
time between retries when we get some special types of Exceptions?
> Improvement of the pause time between retries in Rpc caller
> -----------------------------------------------------------
>
> Key: HBASE-26955
> URL: https://issues.apache.org/jira/browse/HBASE-26955
> Project: HBase
> Issue Type: Improvement
> Components: Client
> Affects Versions: 2.4.2
> Reporter: Haoze Wu
> Priority: Major
>
> We were using the HBase version 2.4.2 and had clients submitting requests to
> a HBase cluster. We observed an unreasonable delay of 10 seconds in some
> request of table create. We analyze the log and jstack results and reproduce
> this issue. The jstack of the client shows that this wierd long delay comes
> from
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L168]
> :
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x00007f027c00b800 nid=0x530d in Object.wait()
> [0x00007f02853b1000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x00000003d52674f0> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
> - locked <0x00000003d52674f0> (a
> java.util.concurrent.atomic.AtomicBoolean)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3018)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3010)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:648)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.createTableAsync(HBaseAdmin.java:4326)
> at org.apache.hadoop.hbase.client.Admin.createTable(Admin.java:317)
> at
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClient.createTable(HBaseGrayClient.java:63)
> at
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.create(HBaseGrayClientMain.java:70)
> at
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.run(HBaseGrayClientMain.java:157)
> at
> edu.jhu.order.mcgray.hb_2_4_2.HBaseGrayClientMain.main(HBaseGrayClientMain.java:183)
> {code}
> This delay of wait comes from the calculation within
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/RpcRetryingCallerImpl.java#L96-L177]
> , which can be simplified as follows:
> {code:java}
> public T callWithRetries(RetryingCallable<T> callable, int callTimeout)
> throws IOException, RuntimeException {
> // ...
> for (int tries = 0;; tries++) {
> long expectedSleep;
> try {
> // do the call ...
> } catch (PreemptiveFastFailException e) {
> throw e;
> } catch (Throwable t) {
> ExceptionUtil.rethrowIfInterrupt(t);
> Throwable cause = t.getCause();
> if (cause instanceof DoNotRetryIOException) { // line 113
> // Fail fast
> throw (DoNotRetryIOException) cause;
> }
> // translateException throws exception when should not retry: i.e.
> when request is bad.
> interceptor.handleFailure(context, t);
> t = translateException(t);
> if (tries > startLogErrorsCnt) {
> // log ...
> }
> // ...
> // If the server is dead, we need to wait a little before retrying,
> to give
> // a chance to the regions to be moved
> // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where
> pauseBase might be
> // special when encountering CallQueueTooBigException, see
> #HBASE-17114
> long pauseBase = (t instanceof CallQueueTooBigException) ?
> pauseForCQTBE : pause; // line 151
> expectedSleep = callable.sleep(pauseBase, tries); // line 152
> // If, after the planned sleep, there won't be enough time left, we
> stop now.
> long duration = singleCallDuration(expectedSleep);
> if (duration > callTimeout) {
> // throw exception ...
> }
> } finally {
> interceptor.updateFailureInfo(context);
> }
> try {
> if (expectedSleep > 0) {
> synchronized (cancelled) {
> if (cancelled.get()) return null;
> cancelled.wait(expectedSleep); // line 168
> }
> }
> if (cancelled.get()) return null;
> } catch (InterruptedException e) {
> throw new InterruptedIOException("Interrupted after " + tries
> + " tries while maxAttempts=" + maxAttempts);
> }
> }
> } {code}
> Within the retry loop, the `cause` variable represents the reason for the
> current failed request and it is checked to determine what we should do. For
> example, the case of DoNotRetryIOException is checked at line 113, and the
> case of CallQueueTooBigException is checked at line 151. Finally we determine
> `expectedSleep`, the time we should wait before the next retry. And then we
> enforce this wait at line 168, unless it is canceled.
> The comments seems to imply that this retry is due to dead server:
> {code:java}
> // If the server is dead, we need to wait a little before retrying,
> to give
> // a chance to the regions to be moved
> // get right pause time, start by RETRY_BACKOFF[0] * pauseBase, where
> pauseBase might be
> // special when encountering CallQueueTooBigException, see
> #HBASE-17114
> long pauseBase = (t instanceof CallQueueTooBigException) ?
> pauseForCQTBE : pause; // line 151
> expectedSleep = callable.sleep(pauseBase, tries); // line 152 {code}
> However, the client log shows that some retries are not due to dead server:
> {code:java}
> 2022-04-10 15:44:44,146 INFO [main] client.RpcRetryingCallerImpl: Call
> exception, tries=6, retries=16, started=4858 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
> at
> org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:2729)
> at
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1985)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:48,181 INFO [main] client.RpcRetryingCallerImpl: Call
> exception, tries=7, retries=16, started=8893 ms ago, cancelled=false,
> msg=java.io.IOException: java.io.IOException: IOException
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:463)
> at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)
> at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)
> Caused by: java.io.UncheckedIOException: java.io.IOException: IOException
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:377)
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.runWithoutRpcCall(RegionProcedureStore.java:313)
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.update(RegionProcedureStore.java:364)
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.insert(RegionProcedureStore.java:323)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.submitProcedure(ProcedureExecutor.java:1047)
> at
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil$NonceProcedureRunnable.submitProcedure(MasterProcedureUtil.java:107)
> at org.apache.hadoop.hbase.master.HMaster$4.run(HMaster.java:2010)
> at
> org.apache.hadoop.hbase.master.procedure.MasterProcedureUtil.submitProcedure(MasterProcedureUtil.java:134)
> at
> org.apache.hadoop.hbase.master.HMaster.createTable(HMaster.java:1997)
> at
> org.apache.hadoop.hbase.master.MasterRpcServices.createTable(MasterRpcServices.java:705)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
> at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:395)
> ... 3 more
> Caused by: java.io.IOException: IOException
> at
> org.apache.hadoop.hbase.procedure2.store.region.RegionProcedureStore.lambda$update$7(RegionProcedureStore.java:370)
> ... 14 more
> , details=, see https://s.apache.org/timeout
> 2022-04-10 15:44:58,835 INFO [main] client.HBaseAdmin: Operation: CREATE,
> Table Name: default:t4, procId: 22 completed
> {code}
> The client has some retries due to dead server (PleaseHoldException: Master
> is initializing), and then gets some internal error when creating the table
> and finally gets the table created successfully. We understand that the
> waiting time for each retry increases exponentially as the retry number
> grows, according to the retry backoff mechanism. However, since the comments
> before line 151 does not specify how we should determine the retry pause
> time, we suspect that the retry time after the successful server startup
> should be considered separately, regardless of the time used for waiting for
> server startup. To confirm this hypothesis, we find some RPC retry call
> implementations in the same version (2.4.2) of HBase:
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RemoteProcedureResultReporter.java#L95-L111]
> and
> [https://github.com/apache/hbase/blob/rel/2.4.2/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L2453-L2475]
> .
> For example, in RemoteProcedureResultReporter, the retry time is calculated
> separately, after the server startup:
> {code:java}
> while (!server.isStopped()) {
> // ...
> try {
> server.reportProcedureDone(builder.build());
> builder.clear();
> tries = 0;
> } catch (IOException e) {
> boolean pause =
> e instanceof ServerNotRunningYetException || e instanceof
> PleaseHoldException;
> long pauseTime;
> if (pause) {
> // Do backoff else we flood the Master with requests.
> pauseTime = ConnectionUtils.getPauseTime(INIT_PAUSE_TIME_MS, tries);
> } else {
> pauseTime = INIT_PAUSE_TIME_MS; // Reset.
> }
> LOG.info("Failed procedure report " +
> TextFormat.shortDebugString(request) + "; retry (#" +
> tries + ")" + (pause ? " after " + pauseTime + "ms delay (Master is
> coming online...)."
> : " immediately."),
> e);
> Threads.sleep(pauseTime);
> tries++;
> }
> } {code}
> Similarly, we suspect RpcRetryingCallerImpl should consider
> ServerNotRunningYetException and PleaseHoldException in this way. We followed
> this thought to write a fix and found that the symptom no longer exists. We
> had tested this issue and the fix in version 2.4.2 and 2.4.11 and
> 3.0.0-alpha-2. We believe this issue exists in these versions and our fix is
> applicable to these versions.
> P.S. In the latest version (3.0.0-alpha-2), RpcRetryingCallerImpl no longer
> exists. The issue and fix is applied to AsyncRpcRetryingCaller instead.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)