[
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17867058#comment-17867058
]
Hudson commented on HBASE-28428:
--------------------------------
Results for branch master
[build #1124 on
builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/1124/]:
(x) *{color:red}-1 overall{color}*
----
details (if available):
(/) {color:green}+1 general checks{color}
-- For more information [see general
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/1124/General_20Nightly_20Build_20Report/]
(x) {color:red}-1 jdk17 hadoop3 checks{color}
-- For more information [see jdk17
report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/1124/JDK17_20Nightly_20Build_20Report_20_28Hadoop3_29/]
(/) {color:green}+1 source release artifact{color}
-- See build output for details.
(x) {color:red}-1 client integration test{color}
-- Something went wrong with this stage, [check relevant console
output|https://ci-hbase.apache.org/job/HBase%20Nightly/job/master/1124//console].
> Zookeeper ConnectionRegistry APIs should have timeout
> -----------------------------------------------------
>
> Key: HBASE-28428
> URL: https://issues.apache.org/jira/browse/HBASE-28428
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
> Reporter: Viraj Jasani
> Assignee: Divneet Kaur
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0-beta-2
>
>
> Came across a couple of instances where active master failover happens around
> the same time as Zookeeper leader failover, leading to stuck HBase client if
> one of the threads is blocked on one of the ConnectionRegistry rpc calls.
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their
> usages do not have any timeouts, which can potentially lead to the entire
> client in stuck state indefinitely as we take some global locks. For
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_
> gets stuck, we can block any admin operation that needs
> {_}getKeepAliveMasterService(){_}.
>
> Sample stacktrace that blocked all client operations that required table
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case,
> retrieving clusterId from ZNode got stuck and that blocked client from being
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.util.PhoenixContextExecutor.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
> org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
> org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
> org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
> org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code}
> We should provide configurable timeout for all ConnectionRegistry APIs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)