Hi, We are using helix-0.6.5.22 version of helix for pinot development ( https://github.com/linkedin/pinot)
The pinot controller constructs a ZkHelixPropertyStore() and adds a listener to listen to changes on all nodes under the path specified. If the zookeeper server is restarted, then the controller tries to re-connect and send the entire watch list to the server. Since there is a 1M limit on the size of the message that the server can receive, the server drops the connection. At this point, we see that the controller goes into a spin, creating a new ZkClient object as the connection moves between SyncConnected and Disconnected states. I have not tried to reproduce this problem independent of pinot-controller, but I suspect the following steps could do it. - Start zk server - Start a program that calls new ZkHelixPropertyStore() that ends up setting watches on a bunch of nodes that may exceed 10k bytes (total of all path name lengths). - Stop the zk server - Restart the zk server setting jute.maxbuffer to 10k. - The client will probably go into a spin leaking ZkClient objects. The objects seem to be in a LinkedBlockingQueue, since jvisualvm also shows LinkedBlockingQueue$Node increasing proportionately by the same number. Here is a sequence of error messages that may help: 2015/10/21 17:25:18.780 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Unable to read additional data from server sessionid 0x1508cecdce10001, likely server has closed socket, closing socket connection and attempting reconnect 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (Disconnected) 2015/10/21 17:25:18.880 INFO [ZKHelixManager] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance: ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Opening socket connection to server localhost/ 127.0.0.1:2181 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Socket connection established to localhost/ 127.0.0.1:2181, initiating session 2015/10/21 17:25:19.743 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Session establishment complete on server localhost/ 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (SyncConnected) 2015/10/21 17:25:19.743 INFO [ZKHelixManager] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState: SyncConnected, zookeeper:State:CONNECTED Timeout:30000 sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 remoteserver:localhost/ 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0 pendingresp:1 queuedevents:0 2015/10/21 17:25:19.744 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Unable to read additional data from server sessionid 0x1508cecdce10001, likely server has closed socket, closing socket connection and attempting reconnect 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (Disconnected) 2015/10/21 17:25:19.844 INFO [ZKHelixManager] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance: ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER 2015/10/21 17:25:19.844 ERROR [ZKHelixManager] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it. maxDisconnectThreshold: 5 disconnects in 300000ms. 2015/10/21 17:25:19.844 INFO [ZKHelixManager] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting down HelixTaskExecutor 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset HelixTaskExecutor 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Unregistering ClusterStatus: cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset exectuor for msgType: CONTROLLER_MSG, pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset exectuor for msgType: SCHEDULER_MSG, pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset exectuor for msgType: TASK_REPLY, pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown HelixTaskExecutor finished 2015/10/21 17:25:19.847 INFO [CallbackHandler] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34 START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES listener:org.apache.helix.messaging.handling.HelixTaskExecutor 2015/10/21 17:25:19.847 INFO [CallbackHandler] [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path: /pinotDevDeploy/CONTROLLER/MESSAGES, listener: org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc 2015/10/21 17:25:20.340 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Opening socket connection to server localhost/0:0:0:0:0:0:0:1:2181 2015/10/21 17:25:20.341 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Socket connection established to localhost/0:0:0:0:0:0:0:1:2181, initiating session 2015/10/21 17:25:20.342 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (SyncConnected) 2015/10/21 17:25:20.343 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Unable to read additional data from server sessionid 0x1508cecdce10001, likely server has closed socket, closing socket connection and attempting reconnect 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (Disconnected) 2015/10/21 17:25:20.534 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Opening socket connection to server localhost/ 127.0.0.1:2181 2015/10/21 17:25:20.535 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Socket connection established to localhost/ 127.0.0.1:2181, initiating session 2015/10/21 17:25:20.536 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Session establishment complete on server localhost/ 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (SyncConnected) 2015/10/21 17:25:20.537 INFO [ClientCnxn] [main-SendThread(localhost:2181)] [pinot-controller] [] Unable to read additional data from server sessionid 0x1508cecdce10001, likely server has closed socket, closing socket connection and attempting reconnect 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread] [pinot-controller] [] zookeeper state changed (Disconnected) And so on. So, it does look like Helix is trying to shut down, and prints the START:INVOKE, but it never prints the END:INVOKE for the same listener. Any ideas? -Subbu
