[jira] [Commented] (CURATOR-134) Curator sends a connection LOST event before sessionTimeout

Cameron McKenzie (JIRA) Mon, 18 Aug 2014 23:49:10 -0700

    [ 
https://issues.apache.org/jira/browse/CURATOR-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101930#comment-14101930
 ]


Cameron McKenzie commented on CURATOR-134:
------------------------------------------

I think that I've tracked down what the problem is. It can occur when there's 
connection loss, followed by connection reestablishment followed by connection 
loss again. Something along the lines of the following occurs.

Assuming retry 3 times, 10 second sleep between retries.

-Connected to ZK
-Connection is lost
-Start the background sync that occurs on connection loss. This initially fails 
because there's no connection, and gets put on the retry queue to occur in 10 
seconds.

Less than 10 seconds passes and the next two events occur
-Connection is reestablished
-Connection is lost

After 10 seconds has passed
-Background retry from previous connection loss is retried. Fails again, gets 
requeued etc.

The problem is that this 'synch' process has already used one of its configured 
retries, so if the connection does not come back before the rest of the retries 
have expired, then a LOST event is generated. This is why the LOST event is 
generated more quickly than expected. Under a worst case scenario, it would be 
possible for the sync process to be on its last retry with a small amount of 
time left before that retry occurs when connection reestablishment and loss. 
This would cause the lost event to happen essentially immediately after the 
reconnected event.

I'm not sure what the best way to fix this is yet. Ideally, we really want to 
cancel this sync process if a connection is reestablished, because if the 
connection is lost again, then a new sync process gets generated regardless of 
whether one is already running. I'm not sure of the logistics of this though. 
I'm not sure how practical that is though, will have a bit more of a dig.

Any thoughts [~randgalt] (or any of the other devs)?


> Curator sends a connection LOST event before sessionTimeout
> -----------------------------------------------------------
>
>                 Key: CURATOR-134
>                 URL: https://issues.apache.org/jira/browse/CURATOR-134
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.6.0
>         Environment: Ubuntu 12.04
>            Reporter: Benjamin Jaton
>            Priority: Critical
>         Attachments: Test.java
>
>
> Created a Curator client with:
> - connection timeout: 10 seconds
> - session timeout: 30 seconds
> - retry policy: RetryNTimes(3, 10000)
> A scenario where the ensemble is lost produces the the curator client to send 
> a LOST event in less than the expected 30 seconds:
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> The client code is attached, this is the complete output:
> Fri Aug 01 11:16:53 PDT 2014 - CURATOR STATE: CONNECTED
> Fri Aug 01 11:16:54 PDT 2014 - Creating ZK client...
> Fri Aug 01 11:16:54 PDT 2014 - ZK client created...
> Fri Aug 01 11:16:54 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:16:58 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:16:58 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:16 PDT 2014 - CURATOR STATE: RECONNECTED
> Fri Aug 01 11:17:17 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:17:19 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> I think that the LOST event is actually 30 seconds away from the very first 
> SUSPENDED event, whereas is should be 30 seconds away from the last one.
> To reproduce it, I started only 2 ZK servers in a 3 nodes ensembles, then I 
> stopped one of them (-> 1st SUSPENDED), waited for 10-20 seconds, then 
> started it and stopped it again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (CURATOR-134) Curator sends a connection LOST event before sessionTimeout

Reply via email to