[jira] [Commented] (CURATOR-498) LeaderLatch deletes leader and leaves it hung besides a second leader

ASF GitHub Bot (JIRA) Tue, 01 Jan 2019 19:39:25 -0800


    [ 
https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16731751#comment-16731751
 ]


ASF GitHub Bot commented on CURATOR-498:
----------------------------------------

GitHub user Randgalt opened a pull request:

    https://github.com/apache/curator/pull/299

    [CURATOR-498] - Fix protection mode race with ephemeral nodes

    "Protection" has a potential bug. If the connection is lost for long 
enough, Curator will want to kill the session. Session deletions must be 
handled by the Leader ZK instance. At the same time that the session kill is 
being processed, Curator's protection mode handling could be calling the 
follower that it's connected to get the current list of children - this can be 
handled directly by the follower instance without needing to call the leader. 
So, in this scenario, the client will get a list of children that includes the 
ZNode that will get deleted as part of killing the session.
        
    This bug has been in Curator since we added the protection feature to it 
more than 6 years ago. The fix is to include the session ID in the protection 
ID that is generated for the node name when the create mode is an ephemeral 
type. Then, if findProtectedNodeInForeground() finds the node in the use-case 
we've been discussing, it can compare the session ID to the current ZooKeeper 
handle's session ID and disregard the found node if they don't match.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-498

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/299.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #299
    
----
commit 4b0bc85d066f8582b55d76092c391bad04bd48a5
Author: randgalt <randgalt@...>
Date:   2018-12-31T11:24:02Z

    CURATOR-498 - include session ID in log message for injecting session 
expiration

commit dafd091412a834a128c9882d2b9534d1a0ff7735
Author: randgalt <randgalt@...>
Date:   2019-01-02T03:34:41Z

    CURATOR-498
    
    "Protection" has a potential bug. If the connection is lost for long 
enough, Curator will want to kill the session. Session deletions must be 
handled by the Leader ZK instance. At the same time that the session kill is 
being processed, Curator's protection mode handling could be calling the 
follower that it's connected to get the current list of children - this can be 
handled directly by the follower instance without needing to call the leader. 
So, in this scenario, the client will get a list of children that includes the 
ZNode that will get deleted as part of killing the session.
    
    This bug has been in Curator since we added the protection feature to it 
more than 6 years ago. The fix is to include the session ID in the protection 
ID that is generated for the node name when the create mode is an ephemeral 
type. Then, if findProtectedNodeInForeground() finds the node in the use-case 
we've been discussing, it can compare the session ID to the current ZooKeeper 
handle's session ID and disregard the found node if they don't match.

----


> LeaderLatch deletes leader and leaves it hung besides a second leader
> ---------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 
> 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Blocker
>         Attachments: CURATOR-498.png, HaWatcher.log, LeaderLatch0.java, 
> ha.tar.gz, logs.tar.gz
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out 
> of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a 
> while and then restore it, then after Curator in my app restores it's 
> connection to ZK - sometimes not all the 6 clients are found in the latch 
> path (using zkCli.sh). That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the 
> leader in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, 
> and from them it seems like very old create() background callback was 
> surprisingly scheduled and corrupted the current leader with its stale path 
> name. Meaning, this old one called setNode with its stale name, and set 
> itself instead of the leader and deleted the leader. This leaves client 
> running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete 
> create callback cancelled (I think its session was suspended on 22:38:54 and 
> then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks).
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that on 22:39:26 it shows that 0000000485 is replaced by 
> 0000000480 and then probably deleted.
> Note also that at 22:38:52, 34 seconds before, we can see that it was in the 
> reset() method ("RESET OUR PATH") and possibly triggered the creation of 
> 0000000480 then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CURATOR-498) LeaderLatch deletes leader and leaves it hung besides a second leader

Reply via email to