Actually, the fix has been back-ported to 3.4.9 release. I did verify it in
my local environment and the session reestablishment issue is gone after
bump ZK client version in Helix controller.  I will send a pull request to
bump the zk version.


Thanks
Lei

On Tue, Sep 6, 2016 at 10:32 AM, Lei Xia <[email protected]> wrote:

> I verified locally and was able to reproduce the error.  This is not a
> ZkWatch leak,  it seems a known issue (https://issues.apache.org/
> jira/browse/ZOOKEEPER-706) which has been fixed in ZK 3.5.x.  However,
> 3.5.x is still in alpha release, not sure whether should we bump it in
> Helix.
>
>
> Thanks
> Lei
>
> On Wed, Oct 21, 2015 at 9:58 PM, kishore g <[email protected]> wrote:
>
>> Hi Subbu,
>>
>> Can you show the jmap output that showed that the number of zkclients
>> increased.
>>
>> Jason, simulating this is straight forward.
>>
>> - start ZK
>> - set up a dummy cluster
>> - start a dummy server/broker and controller
>> - create a resource
>> - stop ZKServer and restart it after setting jute.maxBuffer to very low
>> number like 10.
>>
>> This will make the client go into connecting/disconnecting loop.  All the
>> while we can do
>>
>> jmap -histo:live and see that the zkClient object keeps increasing.
>>
>> I looked at the code and cant find any place where we would be creating
>> new
>> zkclients. I dont think zkproperty store is needed to reproduce this
>> issue.
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 21, 2015 at 9:25 PM, Zhen Zhang <[email protected]> wrote:
>>
>> > Hi Subbu,
>> >
>> > I don't think it's ZkClient leak. Each ZkClient has a ZkEventThread
>> inside:
>> >
>> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
>> java/org/I0Itec/zkclient/ZkClient.java#L70
>> >
>> > The ZkEventThread has a unique id:
>> >
>> > https://github.com/sgroschupf/zkclient/blob/master/src/main/
>> java/org/I0Itec/zkclient/ZkEventThread.java#L59
>> >
>> > From the log it seems the ZkEventThread remains the same "
>> > ZkClient-EventThread-34", so there is only one zk client.
>> >
>> > One reason I can think of is zk watch leak. This jira describes the
>> issue:
>> > https://issues.apache.org/jira/browse/HELIX-527
>> >
>> > Does restart the controller resolve the issue?
>> >
>> > Thanks,
>> > Jason
>> >
>> > On Wed, Oct 21, 2015 at 7:09 PM, Subbu Subramaniam <
>> > [email protected]> wrote:
>> >
>> > > Hi,
>> > >
>> > > We are using helix-0.6.5.22 version of helix for pinot development (
>> > > https://github.com/linkedin/pinot)
>> > >
>> > > The pinot controller  constructs a ZkHelixPropertyStore() and adds a
>> > > listener to listen to changes on all nodes under the path specified.
>> > >
>> > > If the zookeeper server is restarted, then the controller tries to
>> > > re-connect and send the entire watch list to the server. Since there
>> is a
>> > > 1M limit on the size of the message that the server can receive, the
>> > server
>> > > drops the connection.
>> > >
>> > > At this point, we see that the controller goes into a spin, creating a
>> > new
>> > > ZkClient object as the connection moves between SyncConnected and
>> > > Disconnected states.
>> > >
>> > > I have not tried to reproduce this problem independent of
>> > pinot-controller,
>> > > but I suspect the following steps could do it.
>> > >
>> > >
>> > >    - Start zk server
>> > >    - Start a program that calls new ZkHelixPropertyStore() that ends
>> up
>> > >    setting watches on a bunch of nodes that may exceed 10k bytes
>> (total
>> > of
>> > > all
>> > >    path name lengths).
>> > >    - Stop the zk server
>> > >    - Restart the zk server setting jute.maxbuffer to 10k.
>> > >    - The client will probably go into a spin leaking ZkClient objects.
>> > >
>> > >
>> > > The objects seem to be in a LinkedBlockingQueue, since jvisualvm also
>> > shows
>> > > LinkedBlockingQueue$Node increasing proportionately by the same
>> number.
>> > >
>> > > Here is a sequence of error messages that may help:
>> > >
>> > > 2015/10/21 17:25:18.780 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:18.880 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:18.880 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
>> > instance:
>> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
>> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server localhost/
>> > > 127.0.0.1:2181
>> > > 2015/10/21 17:25:19.741 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to localhost/
>> > > 127.0.0.1:2181, initiating session
>> > > 2015/10/21 17:25:19.743 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> localhost/
>> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
>> > 30000
>> > > 2015/10/21 17:25:19.743 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:19.743 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > KeeperState:
>> > > SyncConnected, zookeeper:State:CONNECTED Timeout:30000
>> > > sessionid:0x1508cecdce10001 local:/127.0.0.1:50492
>> > remoteserver:localhost/
>> > > 127.0.0.1:2181 lastZxid:730 xid:465 sent:476 recv:476 queuedpkts:0
>> > > pendingresp:1 queuedevents:0
>> > > 2015/10/21 17:25:19.744 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:19.844 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > KeeperState:Disconnected, disconnectedSessionId: 1508cecdce10001,
>> > instance:
>> > > ssubrama-ld1.linkedin.biz_8862, type: CONTROLLER
>> > > 2015/10/21 17:25:19.844 ERROR [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > instanceName: ssubrama-ld1.linkedin.biz_8862 is flapping. disconnect
>> it.
>> > >  maxDisconnectThreshold: 5 disconnects in 300000ms.
>> > > 2015/10/21 17:25:19.844 INFO [ZKHelixManager]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> disconnect
>> > > ssubrama-ld1.linkedin.biz_8862(CONTROLLER) from pinotDevDeploy
>> > > 2015/10/21 17:25:19.844 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down HelixTaskExecutor
>> > > 2015/10/21 17:25:19.845 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > HelixTaskExecutor
>> > > 2015/10/21 17:25:19.845 INFO [MessageQueueMonitor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > Unregistering ClusterStatus:
>> > > cluster=pinotDevDeploy,messageQueue=ssubrama-ld1.linkedin.biz_8862
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: PARTICIPANT_ERROR_REPORT, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@467b684d[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@467b684d[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: CONTROLLER_MSG, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@53e800f9[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@53e800f9[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: SCHEDULER_MSG, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.846 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@2924f1d8[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] Reset
>> > > exectuor for msgType: TASK_REPLY, pool:
>> > > java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running, pool size
>> = 0,
>> > > active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutting
>> > > down pool: java.util.concurrent.ThreadPoolExecutor@1bc49bc5[Running,
>> > pool
>> > > size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> > > 2015/10/21 17:25:19.847 INFO [HelixTaskExecutor]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> Shutdown
>> > > HelixTaskExecutor finished
>> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] [] 34
>> > > START:INVOKE /pinotDevDeploy/CONTROLLER/MESSAGES
>> > > listener:org.apache.helix.messaging.handling.HelixTaskExecutor
>> > > 2015/10/21 17:25:19.847 INFO [CallbackHandler]
>> > > [ZkClient-EventThread-34-localhost:2181] [pinot-controller] []
>> > > ssubrama-ld1.linkedin.biz_8862 unsubscribe child-change. path:
>> > > /pinotDevDeploy/CONTROLLER/MESSAGES, listener:
>> > > org.apache.helix.messaging.handling.HelixTaskExecutor@77bbadc
>> > > 2015/10/21 17:25:20.340 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server
>> > > localhost/0:0:0:0:0:0:0:1:2181
>> > > 2015/10/21 17:25:20.341 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to
>> > > localhost/0:0:0:0:0:0:0:1:2181, initiating session
>> > > 2015/10/21 17:25:20.342 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> > > localhost/0:0:0:0:0:0:0:1:2181, sessionid = 0x1508cecdce10001,
>> negotiated
>> > > timeout = 30000
>> > > 2015/10/21 17:25:20.342 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:20.343 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:20.443 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > > 2015/10/21 17:25:20.534 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Opening socket connection to server localhost/
>> > > 127.0.0.1:2181
>> > > 2015/10/21 17:25:20.535 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Socket connection established to localhost/
>> > > 127.0.0.1:2181, initiating session
>> > > 2015/10/21 17:25:20.536 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Session establishment complete on server
>> localhost/
>> > > 127.0.0.1:2181, sessionid = 0x1508cecdce10001, negotiated timeout =
>> > 30000
>> > > 2015/10/21 17:25:20.537 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (SyncConnected)
>> > > 2015/10/21 17:25:20.537 INFO [ClientCnxn]
>> > [main-SendThread(localhost:2181)]
>> > > [pinot-controller] [] Unable to read additional data from server
>> > sessionid
>> > > 0x1508cecdce10001, likely server has closed socket, closing socket
>> > > connection and attempting reconnect
>> > > 2015/10/21 17:25:20.637 INFO [ZkClient] [main-EventThread]
>> > > [pinot-controller] [] zookeeper state changed (Disconnected)
>> > >
>> > >
>> > > And so on.
>> > >
>> > > So, it does look like Helix is trying to shut down, and prints the
>> > > START:INVOKE, but it never prints the END:INVOKE for the same
>> listener.
>> > >
>> > > Any ideas?
>> > >
>> > > -Subbu
>> > >
>> >
>>
>
>
>
> --
>
> *Lei Xia *Senior Software Engineer
> Data Infra/Nuage & Helix
> LinkedIn
>
> [email protected]
> www.linkedin.com/in/lxia1
>



-- 

*Lei Xia *Senior Software Engineer
Data Infra/Nuage & Helix
LinkedIn

[email protected]
www.linkedin.com/in/lxia1

Reply via email to