Viraj Jasani created PHOENIX-7233:
-------------------------------------

             Summary: CQSI openConnection should timeout to unblock other 
connection threads
                 Key: PHOENIX-7233
                 URL: https://issues.apache.org/jira/browse/PHOENIX-7233
             Project: Phoenix
          Issue Type: Improvement
    Affects Versions: 5.1.3
            Reporter: Viraj Jasani


PhoenixDriver initializes and caches ConnectionQueryServices objects with 
connectionQueryServicesCache. As part of the CQSI initialization, connection is 
opened with HBase server by using HBase client provided ConnectionFactory, 
which provides Connection object to the client. The Connection object provided 
by HBase allows clients to share Zookeeper connection, meta cache as well as 
remote connections to regionservers and master daemons. The Connection object 
is used to perform Table CRUD operations as well as Administrative actions on 
the cluster.

HBase Connection object initialization requires ClusterId, which is maintained 
either in Zookeeper or Master daemons (or both) and retrieved by client 
depending on whether the client is configured to use ZKConnectionRegistry or 
MasterRegistry/RpcConnectionRegistry.

For ZKConnectionRegistry, we have run into an edge case wherein the connection 
to Zookeeper server got stuck for more than 12 hours. When the client tried to 
create connection to Zookeeper quorum to retrieve the ClusterId, Zookeeper 
leader was switched from one server to another. While the leader switch event 
resulting into stuck connection requires RCA, it is not appropriate for 
Phoenix/HBase client to indefinitely wait for the response from Zookeeper 
without any connection timeout.

For Phoenix client, if one thread is stuck in opening connection during 
CQSI#init, all other threads trying to create connections would get stuck 
because we take class level lock before opening the connection, leading to all 
threads getting stuck and potential termination or degradation of the client 
JVM.

While HBase client should also use timeout, however not having timeout from 
Phoenix client side has far worse complications. As part of this Jira, we 
should introduce a way for CQSI#openConnection to timeout, either by using 
CompletableFuture API or using our preconfigured thread-pool.

 

Stacktrace for reference:

 
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code}
 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to