[ 
https://issues.apache.org/jira/browse/HBASE-28428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17866485#comment-17866485
 ] 

Viraj Jasani commented on HBASE-28428:
--------------------------------------

We need follow up work to introduce similar timeout for RpcConnectionRegistry 
as well. For that, we might have to introduce separate rpc timeout configs 
because the underlying rpc framework is same for RpcConnectionRegistry and any 
other Admin to Server requests.

> Zookeeper ConnectionRegistry APIs should have timeout
> -----------------------------------------------------
>
>                 Key: HBASE-28428
>                 URL: https://issues.apache.org/jira/browse/HBASE-28428
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.8
>            Reporter: Viraj Jasani
>            Assignee: Divneet Kaur
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0-beta-2
>
>
> Came across a couple of instances where active master failover happens around 
> the same time as Zookeeper leader failover, leading to stuck HBase client if 
> one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
> ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
> usages do not have any timeouts, which can potentially lead to the entire 
> client in stuck state indefinitely as we take some global locks. For 
> instance, _getKeepAliveMasterService()_ takes
> {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ 
> gets stuck, we can block any admin operation that needs 
> {_}getKeepAliveMasterService(){_}.
>  
> Sample stacktrace that blocked all client operations that required table 
> descriptor from Admin:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
> org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
> org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
> org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
> org.apache.hadoop.hbase.client.MasterCallable.prepare
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
> org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
> org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
> org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
> org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
> org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
> org.apache.phoenix.execute.MutationState.sendBatch
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.send
> org.apache.phoenix.execute.MutationState.commit
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.jdbc.PhoenixConnection$?.call
> org.apache.phoenix.call.CallRunner.run
> org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
> Another similar incident is captured on PHOENIX-7233. In this case, 
> retrieving clusterId from ZNode got stuck and that blocked client from being 
> able to create any more HBase Connection. Stacktrace for referece:
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.util.PhoenixContextExecutor.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
> org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
> org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
> org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
> org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply  {code}
> We should provide configurable timeout for all ConnectionRegistry APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to