[jira] [Commented] (CURATOR-504) Race conditions in LeaderLatch after reconnecting to ensemble

Jordan Zimmerman (JIRA) Thu, 31 Jan 2019 19:22:57 -0800


    [ 
https://issues.apache.org/jira/browse/CURATOR-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757924#comment-16757924
 ]


Jordan Zimmerman commented on CURATOR-504:
------------------------------------------

{quote}I believe we hit the first condition because of races that occur after 
client reconnects to ZooKeeper.
Client reconnects to ZooKeeper and LeaderLatch gets the event and calls reset 
method which set the internal state (`ourPath`) to null, removes old latch and 
creates a new one. This happens in thread "Curator-ConnectionStateManager-0".
Almost simultaneously, LeaderLatch gets another even NodeDeleted (here) and 
tries to re-read the list of latches and check leadership. This happens in the 
thread "main-EventThread".
{quote}

This isn't really a race. There's not much that can be done about it. The 
LeaderLatch doesn't know what its node is until the background creation occurs. 
The only reason that a NodeDeleted event would be received is if the previous 
sequence node gets deleted. So, that's a rare case but could happen. The end 
result, though, is merely that the Latch resets. This is noisy but not race-y 
as the end result is fine. A test case should be written that simulates this 
just in case, though. There might be an optimization to detect that the latch 
is in this state - i.e. it's waiting for the node creation result - but I think 
that would be difficult and error prone.



> Race conditions in LeaderLatch after reconnecting to ensemble
> -------------------------------------------------------------
>
>                 Key: CURATOR-504
>                 URL: https://issues.apache.org/jira/browse/CURATOR-504
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.1.0
>            Reporter: Yuri Tceretian
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>         Attachments: 51868597-65791000-231c-11e9-9bfa-1def62bc3ea1.png
>
>
> We use LeaderLatch in a lot of places in our system and when ZooKeeper 
> ensemble is unstable and clients are reconnecting to logs are full of 
> messages like the following:
> {{[2017-08-31 
> 19:18:34,562][ERROR][org.apache.curator.framework.recipes.leader.LeaderLatch] 
> Can't find our node. Resetting. Index: -1 {}}}
> According to the 
> [implementation|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L529-L536],
>  this can happen in two cases:
>  * When internal state `ourPath` is null
>  * When the list of latches does not have the expected one.
> I believe we hit the first condition because of races that occur after client 
> reconnects to ZooKeeper.
>  * Client reconnects to ZooKeeper and LeaderLatch gets the event and calls 
> reset method which set the internal state (`ourPath`) to null, removes old 
> latch and creates a new one. This happens in thread 
> "Curator-ConnectionStateManager-0".
>  * Almost simultaneously, LeaderLatch gets another even NodeDeleted 
> ([here|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L543-L554])
>  and tries to re-read the list of latches and check leadership. This happens 
> in the thread "main-EventThread".
> Therefore, sometimes there is a situation when method `checkLeadership` is 
> called when `ourPath` is null.
> Below is an approximate diagram of what happens:
> !51868597-65791000-231c-11e9-9bfa-1def62bc3ea1.png|width=1261,height=150!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CURATOR-504) Race conditions in LeaderLatch after reconnecting to ensemble

Reply via email to