I verified locally and was able to reproduce the error.  This is not a
ZkWatch leak,  it seems a known issue (
https://issues.apache.org/jira/browse/ZOOKEEPER-706) which has been fixed
in ZK 3.5.x.  However, 3.5.x is still in alpha release, not sure whether
should we bump it in Helix.


Thanks
Lei

On Wed, Oct 21, 2015 at 9:58 PM, kishore g <[email protected]> wrote:

> Hi Subbu,
>
> Can you show the jmap output that showed that the number of zkclients
> increased.
>
> Jason, simulating this is straight forward.
>
> - start ZK
> - set up a dummy cluster
> - start a dummy server/broker and controller
> - create a resource
> - stop ZKServer and restart it after setting jute.maxBuffer to very low
> number like 10.
>
> This will make the client go into connecting/disconnecting loop.  All the
> while we can do
>
> jmap -histo:live and see that the zkClient object keeps increasing.
>
> I looked at the code and cant find any place where we would be creating new
> zkclients. I dont think zkproperty store is needed to reproduce this issue.
>
>
>
>
>
>
> On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <[email protected]> wrote:
>
> > Hi Subbu,
> >
> > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread
> inside:
> >
> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
> java/org/I0Itec/zkclient/ZkClient.java#L70
> >
> > The ZkEventThread has a unique id:
> >
> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
> java/org/I0Itec/zkclient/ZkEventThread.java#L59
> >
> > From the log it seems the ZkEventThread remains the same "
> > ZkClient-EventThread-34", so there is only one zk client.
> >
> > One reason I can think of is zk watch leak. This jira describes the
> issue:
> > https://issues.apache.org/jira/browse/HELIX-527
> >
> > Does restart the controller resolve the issue?
> >
> > Thanks,
> > Jason
> >
> > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
> > [email protected]> wrote:
> >
> > > Hi,
> > >
> > > We are using helix-0.6.5.22 version of helix for pinot development (
> > > https://github.com/linkedin/pinot)
> > >
> > > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> > > listener to listen to changes on all nodes under the path specified.
> > >
> > > If the zookeeper server is restarted, then the controller tries to
> > > re-connect and send the entire watch list to the server. Since there
> is a
> > > 1M limit on the size of the message that the server can receive, the
> > server
> > > drops the connection.
> > >
> > > At this point, we see that the controller goes into a spin, creating a
> > new
> > > ZkClient object as the connection moves between SyncConnected and
> > > Disconnected states.
> > >
> > > I have not tried to reproduce this problem independent of
> > pinot-controller,
> > > but I suspect the following steps could do it.
> > >
> > >
> > >    - Start zk server
> > >    - Start a program that calls new ZkHelixPropertyStore() that ends up
> > >    setting watches on a bunch of nodes that may exceed 10k bytes (total
> > of
> > > all
> > >    path name lengths).
> > >    - Stop the zk server
> > >    - Restart the zk server setting jute.maxbuffer to 10k.
> > >    - The client will probably go into a spin leaking ZkClient objects.
> > >
> > >
> > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
> > shows
> > > LinkedBlockingQueue$Node increasing proportionately by the same number.
> > >
> > > Here is a sequence of error messages that may help:
> > >
> > > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> > instance:
> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server localhost/
> > > 127.0.0.1:2181
> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to localhost/
> > > 127.0.0.1:2181, initiating session
> > > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> localhost/
> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> > 30000
> > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:
> > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
> > remoteserver:localhost/
> > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> > > pendingresp:1 queuedevents:0
> > > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> > instance:
> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect
> it.
> > >  maxDisconnectThreshold: 5 disconnects in 300000ms.
> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> disconnect
> > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down HelixTaskExecutor
> > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > HelixTaskExecutor
> > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > Unregistering ClusterStatus:
> > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: CONTROLLER_MSG, pool:
> > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: SCHEDULER_MSG, pool:
> > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > > exectuor for msgType: TASK_REPLY, pool:
> > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size =
> 0,
> > > active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutting
> > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
> > pool
> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Shutdown
> > > HelixTaskExecutor finished
> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> > > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server
> > > localhost/0:0:0:0:0:0:0:1:2181
> > > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to
> > > localhost/0:0:0:0:0:0:0:1:2181, initiating session
> > > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001,
> negotiated
> > > timeout = 30000
> > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Opening socket connection to server localhost/
> > > 127.0.0.1:2181
> > > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Socket connection established to localhost/
> > > 127.0.0.1:2181, initiating session
> > > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Session establishment complete on server
> localhost/
> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> > 30000
> > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
> > [main-SendThread(localhost:2181)]
> > > [pinot-controller] [] Unable to read additional data from server
> > sessionid
> > > 0x1508cecdce10001, likely server has closed socket, closing socket
> > > connection and attempting reconnect
> > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> > > [pinot-controller] [] zookeeper state changed (Disconnected)
> > >
> > >
> > > And so on.
> > >
> > > So, it does look like Helix is trying to shut down, and prints the
> > > START:INVOKE, but it never prints the END:INVOKE for the same listener.
> > >
> > > Any ideas?
> > >
> > > -Subbu
> > >
> >
>



-- 

*Lei Xia *Senior Software Engineer
Data Infra/Nuage & Helix
LinkedIn

[email protected]
www.linkedin.com/in/lxia1

Reply via email to