Viraj Jasani created PHOENIX-7233: ------------------------------------- Summary: CQSI openConnection should timeout to unblock other connection threads Key: PHOENIX-7233 URL: https://issues.apache.org/jira/browse/PHOENIX-7233 Project: Phoenix Issue Type: Improvement Affects Versions: 5.1.3 Reporter: Viraj Jasani
PhoenixDriver initializes and caches ConnectionQueryServices objects with connectionQueryServicesCache. As part of the CQSI initialization, connection is opened with HBase server by using HBase client provided ConnectionFactory, which provides Connection object to the client. The Connection object provided by HBase allows clients to share Zookeeper connection, meta cache as well as remote connections to regionservers and master daemons. The Connection object is used to perform Table CRUD operations as well as Administrative actions on the cluster. HBase Connection object initialization requires ClusterId, which is maintained either in Zookeeper or Master daemons (or both) and retrieved by client depending on whether the client is configured to use ZKConnectionRegistry or MasterRegistry/RpcConnectionRegistry. For ZKConnectionRegistry, we have run into an edge case wherein the connection to Zookeeper server got stuck for more than 12 hours. When the client tried to create connection to Zookeeper quorum to retrieve the ClusterId, Zookeeper leader was switched from one server to another. While the leader switch event resulting into stuck connection requires RCA, it is not appropriate for Phoenix/HBase client to indefinitely wait for the response from Zookeeper without any connection timeout. For Phoenix client, if one thread is stuck in opening connection during CQSI#init, all other threads trying to create connections would get stuck because we take class level lock before opening the connection, leading to all threads getting stuck and potential termination or degradation of the client JVM. While HBase client should also use timeout, however not having timeout from Phoenix client side has far worse complications. As part of this Jira, we should introduce a way for CQSI#openConnection to timeout, either by using CompletableFuture API or using our preconfigured thread-pool. Stacktrace for reference: {code:java} jdk.internal.misc.Unsafe.park java.util.concurrent.locks.LockSupport.park java.util.concurrent.CompletableFuture$Signaller.block java.util.concurrent.ForkJoinPool.managedBlock java.util.concurrent.CompletableFuture.waitingGet java.util.concurrent.CompletableFuture.get org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId org.apache.hadoop.hbase.client.ConnectionImplementation.<init> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance? jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance java.lang.reflect.Constructor.newInstance org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$? org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run java.security.AccessController.doPrivileged javax.security.auth.Subject.doAs org.apache.hadoop.security.UserGroupInformation.doAs org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs org.apache.hadoop.hbase.client.ConnectionFactory.createConnection org.apache.hadoop.hbase.client.ConnectionFactory.createConnection org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection org.apache.phoenix.query.ConnectionQueryServicesImpl.access$? org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call org.apache.phoenix.util.PhoenixContextExecutor.call org.apache.phoenix.query.ConnectionQueryServicesImpl.init org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$? org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$? org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)