[jira] [Updated] (CURATOR-498) LeaderLatch deletes leader and leaves it hung beside a second leader

Shay Shimony (JIRA) Tue, 25 Dec 2018 14:34:12 -0800


     [ 
https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shay Shimony updated CURATOR-498:
---------------------------------
    Description: 
The Curator app I am working on uses the LeaderLatch to select a leader out of 
6 clients.

While testing my app, I noticed that when I make ZK lose its quorum for a while 
and then restore it, then after my app restores it's connection to ZK - 
sometimes not all the 6 clients are found in the latch path (using zkCli.sh). 
That is, I have 5 instead of 6.

After investigating a little, I have a suspicion that LeaderLatch deleted the 
leader in method setNode.

To investigate it I copied the LeaderLatch code and added some log messages, 
and from them it seems like very old create() background callback was 
surprisingly scheduled and corrupted the current leader with its stale path 
name. Meaning, this old one called setNode with its stale name, and set itself 
instead of the leader and deleted the leader. This leaves client running, 
thinking it is the leader, while another leader is selected.

If my analysis is correct then it seems like we need to make this obsolete 
create callback cancelled.

Please see attached log file and modified LeaderLatch0.

 

In the log, note that 0000000485 is replaced by 0000000480 and then probably 
deleted.

  was:
The Curator app I am working on uses the LeaderLatch to select a leader out of 
6 clients.

While testing my app, I noticed that when I make ZK lose its quorum for a while 
and then restore it, then after my app restores it's connection to ZK - 
sometimes not all the 6 clients are found in the latch path (using zkCli.sh). 
That is, I have 5 instead of 6.

After investigating a little, I have a suspicion that LeaderLatch deleted the 
leader in method setNode.

To investigate it I copied the LeaderLatch code and added some log messages, 
and from them it seems like very old create() background callback was 
surprisingly scheduled and corrupted the current leader with its stale path 
name. Meaning, this old one called setNode with its stale name, while sets 
itself instead of the leader and deletes the leader. This leaves client 
running, thinking it is the leader, while another leader is selected.

If my analysis is correct then it seems like we need to make this obsolete 
create callback cancelled.

Please see attached log file and modified LeaderLatch0.

 

In the log, note that 0000000485 is replaced by 0000000480 and then probably 
deleted.


> LeaderLatch deletes leader and leaves it hung beside a second leader
> --------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 
> 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Major
>         Attachments: HaWatcher.log, LeaderLatch0.java
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out 
> of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a 
> while and then restore it, then after my app restores it's connection to ZK - 
> sometimes not all the 6 clients are found in the latch path (using zkCli.sh). 
> That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the 
> leader in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, 
> and from them it seems like very old create() background callback was 
> surprisingly scheduled and corrupted the current leader with its stale path 
> name. Meaning, this old one called setNode with its stale name, and set 
> itself instead of the leader and deleted the leader. This leaves client 
> running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete 
> create callback cancelled.
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that 0000000485 is replaced by 0000000480 and then probably 
> deleted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CURATOR-498) LeaderLatch deletes leader and leaves it hung beside a second leader

Reply via email to