Hi Subbu,

Can you show the jmap output that showed that the number of zkclients
increased.

Jason, simulating this is straight forward.

- start ZK
- set up a dummy cluster
- start a dummy server/broker and controller
- create a resource
- stop ZKServer and restart it after setting jute.maxBuffer to very low
number like 10.

This will make the client go into connecting/disconnecting loop.  All the
while we can do

jmap -histo:live and see that the zkClient object keeps increasing.

I looked at the code and cant find any place where we would be creating new
zkclients. I dont think zkproperty store is needed to reproduce this issue.






On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <[email protected]> wrote:

> Hi Subbu,
>
> I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70
>
> The ZkEventThread has a unique id:
>
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59
>
> From the log it seems the ZkEventThread remains the same "
> ZkClient-EventThread-34", so there is only one zk client.
>
> One reason I can think of is zk watch leak. This jira describes the issue:
> https://issues.apache.org/jira/browse/HELIX-527
>
> Does restart the controller resolve the issue?
>
> Thanks,
> Jason
>
> On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
> [email protected]> wrote:
>
> > Hi,
> >
> > We are using helix-0.6.5.22 version of helix for pinot development (
> > https://github.com/linkedin/pinot)
> >
> > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> > listener to listen to changes on all nodes under the path specified.
> >
> > If the zookeeper server is restarted, then the controller tries to
> > re-connect and send the entire watch list to the server. Since there is a
> > 1M limit on the size of the message that the server can receive, the
> server
> > drops the connection.
> >
> > At this point, we see that the controller goes into a spin, creating a
> new
> > ZkClient object as the connection moves between SyncConnected and
> > Disconnected states.
> >
> > I have not tried to reproduce this problem independent of
> pinot-controller,
> > but I suspect the following steps could do it.
> >
> >
> >    - Start zk server
> >    - Start a program that calls new ZkHelixPropertyStore() that ends up
> >    setting watches on a bunch of nodes that may exceed 10k bytes (total
> of
> > all
> >    path name lengths).
> >    - Stop the zk server
> >    - Restart the zk server setting jute.maxbuffer to 10k.
> >    - The client will probably go into a spin leaking ZkClient objects.
> >
> >
> > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
> shows
> > LinkedBlockingQueue$Node increasing proportionately by the same number.
> >
> > Here is a sequence of error messages that may help:
> >
> > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> instance:
> > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server localhost/
> > 127.0.0.1:2181
> > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to localhost/
> > 127.0.0.1:2181, initiating session
> > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server localhost/
> > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> 30000
> > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:
> > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
> remoteserver:localhost/
> > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> > pendingresp:1 queuedevents:0
> > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
> instance:
> > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it.
> >  maxDisconnectThreshold: 5 disconnects in 300000ms.
> > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect
> > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down HelixTaskExecutor
> > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > HelixTaskExecutor
> > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > Unregistering ClusterStatus:
> > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: CONTROLLER_MSG, pool:
> > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: SCHEDULER_MSG, pool:
> > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> > exectuor for msgType: TASK_REPLY, pool:
> > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0,
> > active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
> pool
> > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown
> > HelixTaskExecutor finished
> > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server
> > localhost/0:0:0:0:0:0:0:1:2181
> > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to
> > localhost/0:0:0:0:0:0:0:1:2181, initiating session
> > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server
> > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated
> > timeout = 30000
> > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Opening socket connection to server localhost/
> > 127.0.0.1:2181
> > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Socket connection established to localhost/
> > 127.0.0.1:2181, initiating session
> > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Session establishment complete on server localhost/
> > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
> 30000
> > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (SyncConnected)
> > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
> [main-SendThread(localhost:2181)]
> > [pinot-controller] [] Unable to read additional data from server
> sessionid
> > 0x1508cecdce10001, likely server has closed socket, closing socket
> > connection and attempting reconnect
> > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> > [pinot-controller] [] zookeeper state changed (Disconnected)
> >
> >
> > And so on.
> >
> > So, it does look like Helix is trying to shut down, and prints the
> > START:INVOKE, but it never prints the END:INVOKE for the same listener.
> >
> > Any ideas?
> >
> > -Subbu
> >
>

Reply via email to