Hi Subbu,

I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread inside:
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L70

The ZkEventThread has a unique id:
https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkEventThread.java#L59

>From the log it seems the ZkEventThread remains the same "
ZkClient-EventThread-34", so there is only one zk client.

One reason I can think of is zk watch leak. This jira describes the issue:
https://issues.apache.org/jira/browse/HELIX-527

Does restart the controller resolve the issue?

Thanks,
Jason

On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
[email protected]> wrote:

> Hi,
>
> We are using helix-0.6.5.22 version of helix for pinot development (
> https://github.com/linkedin/pinot)
>
> The pinot controller  constructs a ZkHelixPropertyStore() and adds a
> listener to listen to changes on all nodes under the path specified.
>
> If the zookeeper server is restarted, then the controller tries to
> re-connect and send the entire watch list to the server. Since there is a
> 1M limit on the size of the message that the server can receive, the server
> drops the connection.
>
> At this point, we see that the controller goes into a spin, creating a new
> ZkClient object as the connection moves between SyncConnected and
> Disconnected states.
>
> I have not tried to reproduce this problem independent of pinot-controller,
> but I suspect the following steps could do it.
>
>
>    - Start zk server
>    - Start a program that calls new ZkHelixPropertyStore() that ends up
>    setting watches on a bunch of nodes that may exceed 10k bytes (total of
> all
>    path name lengths).
>    - Stop the zk server
>    - Restart the zk server setting jute.maxbuffer to 10k.
>    - The client will probably go into a spin leaking ZkClient objects.
>
>
> The objects seem to be in a LinkedBlockingQueue, since jvisualvm also shows
> LinkedBlockingQueue$Node increasing proportionately by the same number.
>
> Here is a sequence of error messages that may help:
>
> 2015/10/21 17:25:18.780 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
> ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server localhost/
> 127.0.0.1:2181
> 2015/10/21 17:25:19.741 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to localhost/
> 127.0.0.1:2181, initiating session
> 2015/10/21 17:25:19.743 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server localhost/
> 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
> 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] KeeperState:
> SyncConnected, zookeeper:State:CONNECTED Timeout:30000
> sessionid:0x1508cecdce10001 local:/127.0.0.1:50492 remoteserver:localhost/
> 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
> pendingresp:1 queuedevents:0
> 2015/10/21 17:25:19.744 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001, instance:
> ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
> 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect it.
>  maxDisconnectThreshold: 5 disconnects in 300000ms.
> 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] disconnect
> ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
> 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down HelixTaskExecutor
> 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> HelixTaskExecutor
> 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> Unregistering ClusterStatus:
> cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
> java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: CONTROLLER_MSG, pool:
> java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: SCHEDULER_MSG, pool:
> java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
> exectuor for msgType: TASK_REPLY, pool:
> java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size = 0,
> active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutting
> down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Shutdown
> HelixTaskExecutor finished
> 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
> START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
> listener:org.apache.helix.messaging.handling.HelixTaskExecutor
> 2015/10/21 17:25:19.847 INFO [CallbackHandler]
> [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
> ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
> /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
> org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
> 2015/10/21 17:25:20.340 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server
> localhost/0:0:0:0:0:0:0:1:2181
> 2015/10/21 17:25:20.341 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to
> localhost/0:0:0:0:0:0:0:1:2181, initiating session
> 2015/10/21 17:25:20.342 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server
> localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001, negotiated
> timeout = 30000
> 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:20.343 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
> 2015/10/21 17:25:20.534 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Opening socket connection to server localhost/
> 127.0.0.1:2181
> 2015/10/21 17:25:20.535 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Socket connection established to localhost/
> 127.0.0.1:2181, initiating session
> 2015/10/21 17:25:20.536 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Session establishment complete on server localhost/
> 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout = 30000
> 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (SyncConnected)
> 2015/10/21 17:25:20.537 INFO [ClientCnxn] [main-SendThread(localhost:2181)]
> [pinot-controller] [] Unable to read additional data from server sessionid
> 0x1508cecdce10001, likely server has closed socket, closing socket
> connection and attempting reconnect
> 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
> [pinot-controller] [] zookeeper state changed (Disconnected)
>
>
> And so on.
>
> So, it does look like Helix is trying to shut down, and prints the
> START:INVOKE, but it never prints the END:INVOKE for the same listener.
>
> Any ideas?
>
> -Subbu
>

Reply via email to