[
https://issues.apache.org/jira/browse/CURATOR-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101930#comment-14101930
]
Cameron McKenzie commented on CURATOR-134:
------------------------------------------
I think that I've tracked down what the problem is. It can occur when there's
connection loss, followed by connection reestablishment followed by connection
loss again. Something along the lines of the following occurs.
Assuming retry 3 times, 10 second sleep between retries.
-Connected to ZK
-Connection is lost
-Start the background sync that occurs on connection loss. This initially fails
because there's no connection, and gets put on the retry queue to occur in 10
seconds.
Less than 10 seconds passes and the next two events occur
-Connection is reestablished
-Connection is lost
After 10 seconds has passed
-Background retry from previous connection loss is retried. Fails again, gets
requeued etc.
The problem is that this 'synch' process has already used one of its configured
retries, so if the connection does not come back before the rest of the retries
have expired, then a LOST event is generated. This is why the LOST event is
generated more quickly than expected. Under a worst case scenario, it would be
possible for the sync process to be on its last retry with a small amount of
time left before that retry occurs when connection reestablishment and loss.
This would cause the lost event to happen essentially immediately after the
reconnected event.
I'm not sure what the best way to fix this is yet. Ideally, we really want to
cancel this sync process if a connection is reestablished, because if the
connection is lost again, then a new sync process gets generated regardless of
whether one is already running. I'm not sure of the logistics of this though.
I'm not sure how practical that is though, will have a bit more of a dig.
Any thoughts [~randgalt] (or any of the other devs)?
> Curator sends a connection LOST event before sessionTimeout
> -----------------------------------------------------------
>
> Key: CURATOR-134
> URL: https://issues.apache.org/jira/browse/CURATOR-134
> Project: Apache Curator
> Issue Type: Bug
> Components: Client
> Affects Versions: 2.6.0
> Environment: Ubuntu 12.04
> Reporter: Benjamin Jaton
> Priority: Critical
> Attachments: Test.java
>
>
> Created a Curator client with:
> - connection timeout: 10 seconds
> - session timeout: 30 seconds
> - retry policy: RetryNTimes(3, 10000)
> A scenario where the ensemble is lost produces the the curator client to send
> a LOST event in less than the expected 30 seconds:
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> The client code is attached, this is the complete output:
> Fri Aug 01 11:16:53 PDT 2014 - CURATOR STATE: CONNECTED
> Fri Aug 01 11:16:54 PDT 2014 - Creating ZK client...
> Fri Aug 01 11:16:54 PDT 2014 - ZK client created...
> Fri Aug 01 11:16:54 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:16:58 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:16:58 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:16 PDT 2014 - CURATOR STATE: RECONNECTED
> Fri Aug 01 11:17:17 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:17:19 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> I think that the LOST event is actually 30 seconds away from the very first
> SUSPENDED event, whereas is should be 30 seconds away from the last one.
> To reproduce it, I started only 2 ZK servers in a 3 nodes ensembles, then I
> stopped one of them (-> 1st SUSPENDED), waited for 10-20 seconds, then
> started it and stopped it again.
--
This message was sent by Atlassian JIRA
(v6.2#6252)