[
https://issues.apache.org/jira/browse/HBASE-28663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Manning updated HBASE-28663:
----------------------------------
Status: Patch Available (was: In Progress)
> CanaryTool continues executing and scanning after timeout
> ---------------------------------------------------------
>
> Key: HBASE-28663
> URL: https://issues.apache.org/jira/browse/HBASE-28663
> Project: HBase
> Issue Type: Bug
> Components: canary
> Affects Versions: 2.0.0, 3.0.0
> Reporter: David Manning
> Assignee: David Manning
> Priority: Minor
> Labels: pull-request-available
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> If you run the {{CanaryTool}} in region mode until it reaches the configured
> timeout, the logs and sink results will show that it can continue executing
> and scanning for 10 seconds.
> This is because the RegionTasks have already been submitted to an
> ExecutorService which continues execution after timeout, and the Monitor
> continues execution on a separate thread.
> The 10 second delay in shutdown is seen, in hbase 2.x at least, because
> {{runMonitor}} will close the {{Connection}} and that process
> ([code|https://github.com/apache/hbase/blob/e865c852c0e9a1e9b55b9d1512d379072d3e7a7b/hbase-server/src/main/java/org/apache/hadoop/hbase/tool/CanaryTool.java#L1054-L1094])
> will lead to {{ConnectionImplementation#close}}
> ([code|https://github.com/apache/hbase/blob/e865c852c0e9a1e9b55b9d1512d379072d3e7a7b/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ConnectionImplementation.java#L2272-L2300])
> and inside {{shutdownPools}} we will potentially wait the full 10 seconds of
> {{awaitTermination}} if client operations are in progress.
> The scenario can be improved by simply interrupting the monitor thread, as we
> will often be in an {{invokeAll}} call in a {{sniff}} method. The
> {{invokeAll}} method is blocking, and interrupting the monitor in this call
> will interrupt the client threads and generally shutdown properly and timely.
> However, we can be more robust by also watching for a shutdown signal in the
> various tasks such as {{RegionTask}} so any remaining tasks will drain
> quickly and without errors.
>
> {code:java}
> 2024-06-12 02:57:14 [Time-limited test] ERROR tool.Canary(1076): The monitor
> is running too long (1140098) after timeout limit:1140000 will be killed
> itself !!
> 2024-06-12 02:57:14 [Time-limited test] INFO
> client.ConnectionImplementation(2039): Closing master protocol: MasterService
> 2024-06-12 02:57:14 [pool-3-thread-4] ERROR tool.Canary(353): Read from
> REGION1. on serverName=REGIONSERVER-1, columnFamily=0 failed
> java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException:
> Task
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@54f2a9a4
> rejected from java.util.concurrent.ThreadPoolExecutor@2d3d204d[Shutting
> down, pool size = 7, active threads = 7, queued tasks = 0, completed tasks =
> 180094]
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:199)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:271)
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:440)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:314)
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:612)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:565)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:609)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:503)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:471)
> [... repeats for 10 seconds and tens of thousands of regions ... ]
> 2024-06-12 02:57:16 [pool-3-thread-11] ERROR tool.Canary(353): Read from
> REGION10000. on serverName=REGIONSERVER-2, columnFamily=0 failed
> java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException:
> Task
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@d08d21f
> rejected from java.util.concurrent.ThreadPoolExecutor@2d3d204d[Shutting
> down, pool size = 3, active threads = 3, queued tasks = 0, completed tasks =
> 180098]
> [...]
> 2024-06-12 02:57:24 [pool-3-thread-11] ERROR tool.Canary(353): Read from
> REGION42000. on serverName=REGIONSERVER-3, columnFamily=0 failed
> java.lang.RuntimeException: java.util.concurrent.RejectedExecutionException:
> Task
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture@38e7a5a1
> rejected from java.util.concurrent.ThreadPoolExecutor@2d3d204d[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 180101]
> 2024-06-12T02:57:24.202Z, java.io.InterruptedIOException
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:294)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:255)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:191)
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:271)
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:440)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:314)
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:612)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.readColumnFamily(CanaryTool.java:565)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:609)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:503)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:471)
> 2024-06-12 02:57:24 [pool-3-thread-8] WARN
> client.ConnectionImplementation(428): Checking master connection
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException: Call to
> address=HMASTER-1 failed on local exception:
> org.apache.hadoop.hbase.ipc.StoppedRpcClientException at
> sun.reflect.GeneratedConstructorAccessor27.newInstance(Unknown Source) at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
> org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:206) at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
> at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) at
> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callMethod(AbstractRpcClient.java:446)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:332)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$200(AbstractRpcClient.java:92)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:595)
> at
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceState.isMasterRunning(ConnectionImplementation.java:1235)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.lambda$masterConnectionStateSupplier$1(ConnectionImplementation.java:421)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.isKeepAliveMasterConnectedAndRunning(ConnectionImplementation.java:2025)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService(ConnectionImplementation.java:1441)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster(ConnectionImplementation.java:1433)
> at
> org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:57)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:101)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3020)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:579)
> at org.apache.hadoop.hbase.client.HTable.getDescriptor(HTable.java:242) at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:593)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:503)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:471)
> 2024-06-12 02:57:24 [pool-3-thread-7] ERROR tool.Canary(345): Read from
> REGION43000. on serverName=REGIONSERVER-4 failed
> org.apache.hadoop.hbase.DoNotRetryIOException: Connection was closed while
> trying to get master
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionImplementation.java:1381)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService(ConnectionImplementation.java:1445)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster(ConnectionImplementation.java:1433)
> at
> org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:57)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:101)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:3020)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:579)
> at org.apache.hadoop.hbase.client.HTable.getDescriptor(HTable.java:242)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.read(CanaryTool.java:593)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:503)
> at
> org.apache.hadoop.hbase.tool.CanaryTool$RegionTask.call(CanaryTool.java:471){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)