Hi Subbu, Can you show the jmap output that showed that the number of zkclients increased.
Jason, simulating this is straight forward. - start ZK - set up a dummy cluster - start a dummy server/broker and controller - create a resource - stop ZKServer and restart it after setting jute.maxBuffer to very low number like 10. This will make the client go into connecting/disconnecting loop. All the while we can do jmap -histo:live and see that the zkClient object keeps increasing. I looked at the code and cant find any place where we would be creating new zkclients. I dont think zkproperty store is needed to reproduce this issue. On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <[email protected]> wrote: > Hi Subbu, > > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside: > > https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70 > > The ZkEventThread has a unique id: > > https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59 > > From the log it seems the ZkEventThread remains the same " > ZkClient-EventThread-34", so there is only one zk client. > > One reason I can think of is zk watch leak. This jira describes the issue: > https://issues.apache.org/jira/browse/HELIX-527 > > Does restart the controller resolve the issue? > > Thanks, > Jason > > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam < > [email protected]> wrote: > > > Hi, > > > > We are using helix-0.6.5.22 version of helix for pinot development ( > > https://github.com/linkedin/pinot) > > > > The pinot controller constructs a ZkHelixPropertyStore() and adds a > > listener to listen to changes on all nodes under the path specified. > > > > If the zookeeper server is restarted, then the controller tries to > > re-connect and send the entire watch list to the server. Since there is a > > 1M limit on the size of the message that the server can receive, the > server > > drops the connection. > > > > At this point, we see that the controller goes into a spin, creating a > new > > ZkClient object as the connection moves between SyncConnected and > > Disconnected states. > > > > I have not tried to reproduce this problem independent of > pinot-controller, > > but I suspect the following steps could do it. > > > > > > - Start zk server > > - Start a program that calls new ZkHelixPropertyStore() that ends up > > setting watches on a bunch of nodes that may exceed 10k bytes (total > of > > all > > path name lengths). > > - Stop the zk server > > - Restart the zk server setting jute.maxbuffer to 10k. > > - The client will probably go into a spin leaking ZkClient objects. > > > > > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also > shows > > LinkedBlockingQueue$Node increasing proportionately by the same number. > > > > Here is a sequence of error messages that may help: > > > > 2015/10/21 17:25:18.780 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Unable to read additional data from server > sessionid > > 0x1508cecdce10001, likely server has closed socket, closing socket > > connection and attempting reconnect > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (Disconnected) > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, > instance: > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER > > 2015/10/21 17:25:19.741 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Opening socket connection to server localhost/ > > 127.0.0.1:2181 > > 2015/10/21 17:25:19.741 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Socket connection established to localhost/ > > 127.0.0.1:2181, initiating session > > 2015/10/21 17:25:19.743 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Session establishment complete on server localhost/ > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = > 30000 > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (SyncConnected) > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > KeeperState: > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000 > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 > remoteserver:localhost/ > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0 > > pendingresp:1 queuedevents:0 > > 2015/10/21 17:25:19.744 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Unable to read additional data from server > sessionid > > 0x1508cecdce10001, likely server has closed socket, closing socket > > connection and attempting reconnect > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (Disconnected) > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, > instance: > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it. > > maxDisconnectThreshold: 5 disconnects in 300000ms. > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting > > down HelixTaskExecutor > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset > > HelixTaskExecutor > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > > Unregistering ClusterStatus: > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862 > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool: > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0, > > active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, > pool > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset > > exectuor for msgType: CONTROLLER_MSG, pool: > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0, > > active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, > pool > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset > > exectuor for msgType: SCHEDULER_MSG, pool: > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0, > > active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, > pool > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset > > exectuor for msgType: TASK_REPLY, pool: > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0, > > active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, > pool > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown > > HelixTaskExecutor finished > > 2015/10/21 17:25:19.847 INFO [CallbackHandler] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34 > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor > > 2015/10/21 17:25:19.847 INFO [CallbackHandler] > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path: > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener: > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc > > 2015/10/21 17:25:20.340 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Opening socket connection to server > > localhost/0:0:0:0:0:0:0:1:2181 > > 2015/10/21 17:25:20.341 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Socket connection established to > > localhost/0:0:0:0:0:0:0:1:2181, initiating session > > 2015/10/21 17:25:20.342 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Session establishment complete on server > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated > > timeout = 30000 > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (SyncConnected) > > 2015/10/21 17:25:20.343 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Unable to read additional data from server > sessionid > > 0x1508cecdce10001, likely server has closed socket, closing socket > > connection and attempting reconnect > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (Disconnected) > > 2015/10/21 17:25:20.534 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Opening socket connection to server localhost/ > > 127.0.0.1:2181 > > 2015/10/21 17:25:20.535 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Socket connection established to localhost/ > > 127.0.0.1:2181, initiating session > > 2015/10/21 17:25:20.536 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Session establishment complete on server localhost/ > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = > 30000 > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (SyncConnected) > > 2015/10/21 17:25:20.537 INFO [ClientCnxn] > [main-SendThread(localhost:2181)] > > [pinot-controller] [] Unable to read additional data from server > sessionid > > 0x1508cecdce10001, likely server has closed socket, closing socket > > connection and attempting reconnect > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread] > > [pinot-controller] [] zookeeper state changed (Disconnected) > > > > > > And so on. > > > > So, it does look like Helix is trying to shut down, and prints the > > START:INVOKE, but it never prints the END:INVOKE for the same listener. > > > > Any ideas? > > > > -Subbu > > >
