[jira] [Commented] (ZOOKEEPER-1618) Disconnected event when stopping leader process

Peter Nerg (JIRA) Mon, 25 Feb 2013 00:18:19 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13585712#comment-13585712
 ]


Peter Nerg commented on ZOOKEEPER-1618:
---------------------------------------

Hi,

I'm completely with you when it comes to what ZK should handle and what is up 
to the application.
ZK has no means to verify the correctness of the data model, only the 
application knows that. 
Our application will tolerate/manage temporary disconnections and perform 
recovery as expected, so that's not the issue.

What I'm objecting against is the asymmetry on behavior depending on which ZK 
node you kill.
If you kill a follower the ZK client API will seamlessly migrate the session to 
any of the surviving ZK nodes. But if you kill the leader then all applications 
connected will get a disconnect event and then a connect event as soon as the 
new leader has been elected (typically 4-6 seconds with 3 ZK nodes).
This asymmetry is my issue, why must we get a disconnect event when we elect a 
new leader?

I can't find this behavior documented anywhere and it came as a surprise during 
upgrade testing of our systems.
We upgrade in full traffic which means that as soon as the ZK leader goes down 
we get hickups for a few seconds. We had to implement queuing of requests for a 
short period while waiting for the connection to be established again. Since 
the application has no means to figure out the reason behind the disconnect 
event we have to implement short period queuing not to exhaust the memory.

So the entire background to this issue report is to clear out if this is a 
expected behavior or not.
If it is expected then I expect to see clear documentation stating so. As of 
now the documentation claims that killing a ZK node will basically be handled 
under the hood by the ZK client API. Which in my opinion is a half-truth as the 
application is notified making it believe it lost connection to the ZK ensemble.


                
> Disconnected event when stopping leader process
> -----------------------------------------------
>
>                 Key: ZOOKEEPER-1618
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1618
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.4, 3.4.5
>         Environment: Linux SLES
> java version "1.6.0_31"
>            Reporter: Peter Nerg
>            Priority: Minor
>
> Running a three node ZK cluster I stop/kill the leader node.
> Immediately all connected clients will receive a Disconnected event, a second 
> or so later an event with SyncConnected is received.
> Killing a follower will not produce the same issue/event.
> The application/clients have been implemented to manage Disconnected events 
> so they survive.
> I however expected the ZK client to manage the hickup during the election 
> process. 
> This produces quite a lot of logging in large clusters that have many 
> services relying on ZK.
> In some cases we may loose a few requests as we need a working ZK cluster to 
> execute those requests.
> IMHO it's not really full high availability if the ZK cluster momentarily 
> takes a dive due to that the leader goes away.
> No matter how much redundancy one uses in form of ZK instances one still may 
> get processing errors during leader election.
> I've verified this behavior in both 3.4.4 and 3.4.5

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1618) Disconnected event when stopping leader process

Reply via email to