[jira] [Commented] (CURATOR-504) Race conditions in LeaderLatch after reconnecting to ensemble

Jordan Zimmerman (JIRA) Wed, 06 Feb 2019 08:17:05 -0800


    [ 
https://issues.apache.org/jira/browse/CURATOR-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761894#comment-16761894
 ]


Jordan Zimmerman commented on CURATOR-504:
------------------------------------------

[~yuri.tceretian]: I've had an idea for a while of a circuit breaker style 
{{ConnectionStateListener}}. It would proxy any ConnectionStateListeners used 
by Curator recipe/classes such that when the connection is lost the circuit 
would open for a period of time and, while open, ignore any changes in state. 
After the time period expires the circuit would close and send whatever the 
current connection state is. This way, if the connection is going 
up/down/up/down/up/down, the application would only see the first down and then 
N ms later hopefully the connection is repaired and the application would only 
see the reconnection.

Thoughts?

> Race conditions in LeaderLatch after reconnecting to ensemble
> -------------------------------------------------------------
>
>                 Key: CURATOR-504
>                 URL: https://issues.apache.org/jira/browse/CURATOR-504
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.1.0
>            Reporter: Yuri Tceretian
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>         Attachments: 51868597-65791000-231c-11e9-9bfa-1def62bc3ea1.png, 
> Screen Shot 2019-01-31 at 10.26.59 PM.png, 
> XP91JuD048Nl_8h9NZpH01QZJMfCLewjfd2eQNfOsR6GuApPNV.png
>
>
> We use LeaderLatch in a lot of places in our system and when ZooKeeper 
> ensemble is unstable and clients are reconnecting to logs are full of 
> messages like the following:
> {{[2017-08-31 
> 19:18:34,562][ERROR][org.apache.curator.framework.recipes.leader.LeaderLatch] 
> Can't find our node. Resetting. Index: -1 {}}}
> According to the 
> [implementation|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L529-L536],
>  this can happen in two cases:
>  * When internal state `ourPath` is null
>  * When the list of latches does not have the expected one.
> I believe we hit the first condition because of races that occur after client 
> reconnects to ZooKeeper.
>  * Client reconnects to ZooKeeper and LeaderLatch gets the event and calls 
> reset method which set the internal state (`ourPath`) to null, removes old 
> latch and creates a new one. This happens in thread 
> "Curator-ConnectionStateManager-0".
>  * Almost simultaneously, LeaderLatch gets another even NodeDeleted 
> ([here|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L543-L554])
>  and tries to re-read the list of latches and check leadership. This happens 
> in the thread "main-EventThread".
> Therefore, sometimes there is a situation when method `checkLeadership` is 
> called when `ourPath` is null.
> Below is an approximate diagram of what happens:
> !51868597-65791000-231c-11e9-9bfa-1def62bc3ea1.png|width=1261,height=150!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CURATOR-504) Race conditions in LeaderLatch after reconnecting to ensemble

Reply via email to