[
https://issues.apache.org/jira/browse/ZOOKEEPER-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585712#comment-13585712
]
Peter Nerg commented on ZOOKEEPER-1618:
---------------------------------------
Hi,
I'm completely with you when it comes to what ZK should handle and what is up
to the application.
ZK has no means to verify the correctness of the data model, only the
application knows that.
Our application will tolerate/manage temporary disconnections and perform
recovery as expected, so that's not the issue.
What I'm objecting against is the asymmetry on behavior depending on which ZK
node you kill.
If you kill a follower the ZK client API will seamlessly migrate the session to
any of the surviving ZK nodes. But if you kill the leader then all applications
connected will get a disconnect event and then a connect event as soon as the
new leader has been elected (typically 4-6 seconds with 3 ZK nodes).
This asymmetry is my issue, why must we get a disconnect event when we elect a
new leader?
I can't find this behavior documented anywhere and it came as a surprise during
upgrade testing of our systems.
We upgrade in full traffic which means that as soon as the ZK leader goes down
we get hickups for a few seconds. We had to implement queuing of requests for a
short period while waiting for the connection to be established again. Since
the application has no means to figure out the reason behind the disconnect
event we have to implement short period queuing not to exhaust the memory.
So the entire background to this issue report is to clear out if this is a
expected behavior or not.
If it is expected then I expect to see clear documentation stating so. As of
now the documentation claims that killing a ZK node will basically be handled
under the hood by the ZK client API. Which in my opinion is a half-truth as the
application is notified making it believe it lost connection to the ZK ensemble.
> Disconnected event when stopping leader process
> -----------------------------------------------
>
> Key: ZOOKEEPER-1618
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1618
> Project: ZooKeeper
> Issue Type: Bug
> Affects Versions: 3.4.4, 3.4.5
> Environment: Linux SLES
> java version "1.6.0_31"
> Reporter: Peter Nerg
> Priority: Minor
>
> Running a three node ZK cluster I stop/kill the leader node.
> Immediately all connected clients will receive a Disconnected event, a second
> or so later an event with SyncConnected is received.
> Killing a follower will not produce the same issue/event.
> The application/clients have been implemented to manage Disconnected events
> so they survive.
> I however expected the ZK client to manage the hickup during the election
> process.
> This produces quite a lot of logging in large clusters that have many
> services relying on ZK.
> In some cases we may loose a few requests as we need a working ZK cluster to
> execute those requests.
> IMHO it's not really full high availability if the ZK cluster momentarily
> takes a dive due to that the leader goes away.
> No matter how much redundancy one uses in form of ZK instances one still may
> get processing errors during leader election.
> I've verified this behavior in both 3.4.4 and 3.4.5
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira