Actually, the fix has been back-ported to 3.4.9 release. I did verify it in my local environment and the session reestablishment issue is gone after bump ZK client version in Helix controller. I will send a pull request to bump the zk version.
Thanks Lei On Tue, Sep 6, 2016 at 10:32 AM, Lei Xia <[email protected]> wrote: > I verified locally and was able to reproduce the error. This is not a > ZkWatch leak, it seems a known issue (https://issues.apache.org/ > jira/browse/ZOOKEEPER-706) which has been fixed in ZK 3.5.x. However, > 3.5.x is still in alpha release, not sure whether should we bump it in > Helix. > > > Thanks > Lei > > On Wed, Oct 21, 2015 at 9:58 PM, kishore g <[email protected]> wrote: > >> Hi Subbu, >> >> Can you show the jmap output that showed that the number of zkclients >> increased. >> >> Jason, simulating this is straight forward. >> >> - start ZK >> - set up a dummy cluster >> - start a dummy server/broker and controller >> - create a resource >> - stop ZKServer and restart it after setting jute.maxBuffer to very low >> number like 10. >> >> This will make the client go into connecting/disconnecting loop. All the >> while we can do >> >> jmap -histo:live and see that the zkClient object keeps increasing. >> >> I looked at the code and cant find any place where we would be creating >> new >> zkclients. I dont think zkproperty store is needed to reproduce this >> issue. >> >> >> >> >> >> >> On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <[email protected]> wrote: >> >> > Hi Subbu, >> > >> > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread >> inside: >> > >> > https://github.com/sgroschupf/zkclient/blob/master/src/main/ >> java/org/I0Itec/zkclient/ZkClient.java#L70 >> > >> > The ZkEventThread has a unique id: >> > >> > https://github.com/sgroschupf/zkclient/blob/master/src/main/ >> java/org/I0Itec/zkclient/ZkEventThread.java#L59 >> > >> > From the log it seems the ZkEventThread remains the same " >> > ZkClient-EventThread-34", so there is only one zk client. >> > >> > One reason I can think of is zk watch leak. This jira describes the >> issue: >> > https://issues.apache.org/jira/browse/HELIX-527 >> > >> > Does restart the controller resolve the issue? >> > >> > Thanks, >> > Jason >> > >> > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam < >> > [email protected]> wrote: >> > >> > > Hi, >> > > >> > > We are using helix-0.6.5.22 version of helix for pinot development ( >> > > https://github.com/linkedin/pinot) >> > > >> > > The pinot controller constructs a ZkHelixPropertyStore() and adds a >> > > listener to listen to changes on all nodes under the path specified. >> > > >> > > If the zookeeper server is restarted, then the controller tries to >> > > re-connect and send the entire watch list to the server. Since there >> is a >> > > 1M limit on the size of the message that the server can receive, the >> > server >> > > drops the connection. >> > > >> > > At this point, we see that the controller goes into a spin, creating a >> > new >> > > ZkClient object as the connection moves between SyncConnected and >> > > Disconnected states. >> > > >> > > I have not tried to reproduce this problem independent of >> > pinot-controller, >> > > but I suspect the following steps could do it. >> > > >> > > >> > > - Start zk server >> > > - Start a program that calls new ZkHelixPropertyStore() that ends >> up >> > > setting watches on a bunch of nodes that may exceed 10k bytes >> (total >> > of >> > > all >> > > path name lengths). >> > > - Stop the zk server >> > > - Restart the zk server setting jute.maxbuffer to 10k. >> > > - The client will probably go into a spin leaking ZkClient objects. >> > > >> > > >> > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also >> > shows >> > > LinkedBlockingQueue$Node increasing proportionately by the same >> number. >> > > >> > > Here is a sequence of error messages that may help: >> > > >> > > 2015/10/21 17:25:18.780 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Unable to read additional data from server >> > sessionid >> > > 0x1508cecdce10001, likely server has closed socket, closing socket >> > > connection and attempting reconnect >> > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (Disconnected) >> > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, >> > instance: >> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER >> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Opening socket connection to server localhost/ >> > > 127.0.0.1:2181 >> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Socket connection established to localhost/ >> > > 127.0.0.1:2181, initiating session >> > > 2015/10/21 17:25:19.743 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Session establishment complete on server >> localhost/ >> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = >> > 30000 >> > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (SyncConnected) >> > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > KeeperState: >> > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000 >> > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 >> > remoteserver:localhost/ >> > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0 >> > > pendingresp:1 queuedevents:0 >> > > 2015/10/21 17:25:19.744 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Unable to read additional data from server >> > sessionid >> > > 0x1508cecdce10001, likely server has closed socket, closing socket >> > > connection and attempting reconnect >> > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (Disconnected) >> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, >> > instance: >> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER >> > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect >> it. >> > > maxDisconnectThreshold: 5 disconnects in 300000ms. >> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> disconnect >> > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy >> > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutting >> > > down HelixTaskExecutor >> > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset >> > > HelixTaskExecutor >> > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > > Unregistering ClusterStatus: >> > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862 >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset >> > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool: >> > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size >> = 0, >> > > active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutting >> > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, >> > pool >> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset >> > > exectuor for msgType: CONTROLLER_MSG, pool: >> > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size >> = 0, >> > > active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutting >> > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, >> > pool >> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset >> > > exectuor for msgType: SCHEDULER_MSG, pool: >> > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size >> = 0, >> > > active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutting >> > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, >> > pool >> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset >> > > exectuor for msgType: TASK_REPLY, pool: >> > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size >> = 0, >> > > active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutting >> > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, >> > pool >> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] >> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> Shutdown >> > > HelixTaskExecutor finished >> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34 >> > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES >> > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor >> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler] >> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] >> > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path: >> > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener: >> > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc >> > > 2015/10/21 17:25:20.340 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Opening socket connection to server >> > > localhost/0:0:0:0:0:0:0:1:2181 >> > > 2015/10/21 17:25:20.341 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Socket connection established to >> > > localhost/0:0:0:0:0:0:0:1:2181, initiating session >> > > 2015/10/21 17:25:20.342 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Session establishment complete on server >> > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, >> negotiated >> > > timeout = 30000 >> > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (SyncConnected) >> > > 2015/10/21 17:25:20.343 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Unable to read additional data from server >> > sessionid >> > > 0x1508cecdce10001, likely server has closed socket, closing socket >> > > connection and attempting reconnect >> > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (Disconnected) >> > > 2015/10/21 17:25:20.534 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Opening socket connection to server localhost/ >> > > 127.0.0.1:2181 >> > > 2015/10/21 17:25:20.535 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Socket connection established to localhost/ >> > > 127.0.0.1:2181, initiating session >> > > 2015/10/21 17:25:20.536 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Session establishment complete on server >> localhost/ >> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = >> > 30000 >> > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (SyncConnected) >> > > 2015/10/21 17:25:20.537 INFO [ClientCnxn] >> > [main-SendThread(localhost:2181)] >> > > [pinot-controller] [] Unable to read additional data from server >> > sessionid >> > > 0x1508cecdce10001, likely server has closed socket, closing socket >> > > connection and attempting reconnect >> > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread] >> > > [pinot-controller] [] zookeeper state changed (Disconnected) >> > > >> > > >> > > And so on. >> > > >> > > So, it does look like Helix is trying to shut down, and prints the >> > > START:INVOKE, but it never prints the END:INVOKE for the same >> listener. >> > > >> > > Any ideas? >> > > >> > > -Subbu >> > > >> > >> > > > > -- > > *Lei Xia *Senior Software Engineer > Data Infra/Nuage & Helix > LinkedIn > > [email protected] > www.linkedin.com/in/lxia1 > -- *Lei Xia *Senior Software Engineer Data Infra/Nuage & Helix LinkedIn [email protected] www.linkedin.com/in/lxia1
