Daniel Wong created ZOOKEEPER-4236:
--------------------------------------

             Summary: Java Client SendThread create many unnecessary Login 
objects
                 Key: ZOOKEEPER-4236
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4236
             Project: ZooKeeper
          Issue Type: Bug
            Reporter: Daniel Wong


Hi I am an Apache Phoenix committer and I help manage many many zookeeper 
clusters at my employment primarily using ZK for HBase use cases.  We recently 
had a production incident where some of our ACLs were not setup preventing 
connectivity from the client to the ZK nodes and the failure path exposed 2 
issues to fix. This Jira and ZooKeeper-4235.  This Jira is the less important 
of the 2 and handles numerous objects.  We had hundreds of threads per JVM with 
the following stack trace.  
{code:java}
java.lang.Thread.State: RUNNABLE at 
java.net.PlainSocketImpl.socketConnect(java.base@11.0.4.0.101/Native Method) at 
java.net.AbstractPlainSocketImpl.doConnect(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:399)
 - locked <0x00000015004fde20> (a java.net.SocksSocketImpl) at 
java.net.AbstractPlainSocketImpl.connectToAddress(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:242)
 at 
java.net.AbstractPlainSocketImpl.connect(java.base@11.0.4.0.101/AbstractPlainSocketImpl.java:224)
 at 
java.net.SocksSocketImpl.connect(java.base@11.0.4.0.101/SocksSocketImpl.java:403)
 at java.net.Socket.connect(java.base@11.0.4.0.101/Socket.java:609) at 
sun.security.krb5.internal.TCPClient.<init>(java.security.jgss@11.0.4.0.101/NetClient.java:62)
 at 
sun.security.krb5.internal.NetClient.getInstance(java.security.jgss@11.0.4.0.101/NetClient.java:42)
 at 
sun.security.krb5.KdcComm$KdcCommunication.run(java.security.jgss@11.0.4.0.101/KdcComm.java:401)
 at 
sun.security.krb5.KdcComm$KdcCommunication.run(java.security.jgss@11.0.4.0.101/KdcComm.java:364)
 at java.security.AccessController.doPrivileged(java.base@11.0.4.0.101/Native 
Method) at 
sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:348)
 at 
sun.security.krb5.KdcComm.sendIfPossible(java.security.jgss@11.0.4.0.101/KdcComm.java:253)
 at 
sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:234)
 at 
sun.security.krb5.KdcComm.send(java.security.jgss@11.0.4.0.101/KdcComm.java:200)
 at 
sun.security.krb5.KrbAsReqBuilder.send(java.security.jgss@11.0.4.0.101/KrbAsReqBuilder.java:326)
 at 
sun.security.krb5.KrbAsReqBuilder.action(java.security.jgss@11.0.4.0.101/KrbAsReqBuilder.java:371)
 at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(jdk.security.auth@11.0.4.0.101/Krb5LoginModule.java:754)
 at 
com.sun.security.auth.module.Krb5LoginModule.login(jdk.security.auth@11.0.4.0.101/Krb5LoginModule.java:592)
 at 
javax.security.auth.login.LoginContext.invoke(java.base@11.0.4.0.101/LoginContext.java:726)
 at 
javax.security.auth.login.LoginContext$4.run(java.base@11.0.4.0.101/LoginContext.java:665)
 at 
javax.security.auth.login.LoginContext$4.run(java.base@11.0.4.0.101/LoginContext.java:663)
 at java.security.AccessController.doPrivileged(java.base@11.0.4.0.101/Native 
Method) at 
javax.security.auth.login.LoginContext.invokePriv(java.base@11.0.4.0.101/LoginContext.java:663)
 at 
javax.security.auth.login.LoginContext.login(java.base@11.0.4.0.101/LoginContext.java:574)
 at org.apache.zookeeper.Login.login(Login.java:304) - locked 
<0x000000151c477148> (a org.apache.zookeeper.Login) at 
org.apache.zookeeper.Login.<init>(Login.java:106) at 
org.apache.zookeeper.client.ZooKeeperSaslClient.createSaslClient(ZooKeeperSaslClient.java:249)
 - locked <0x000000151c476f68> (a 
org.apache.zookeeper.client.ZooKeeperSaslClient) at 
org.apache.zookeeper.client.ZooKeeperSaslClient.<init>(ZooKeeperSaslClient.java:141)
 at 
org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:972) at 
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1031)
{code}
Note that these were logging in to our 10 ZK nodes but we had 100s of Logins.  
In theory we  should only need at most 10 Logins.  

This Jira is intended to improve the behavior in limiting the number of Login 
objects/clients to the needed number.  Note that a combination of JIRAs 
https://issues.apache.org/jira/browse/ZOOKEEPER-2375 and 
https://issues.apache.org/jira/browse/ZOOKEEPER-2139  removed the singleton at 
the Login level but left in unnecessary synchronization code.  This could be 
again improved via either a singleton perhaps at the SaslClient layer or some 
sort of connection -> login cache so that new connections would reuse/wait for 
the same objects in failure paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to