[ 
https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16730940#comment-16730940
 ] 

Shay Shimony edited comment on CURATOR-498 at 12/30/18 12:09 PM:
-----------------------------------------------------------------

The ZK transactions log showed me the association between a ZNode and its 
session. For example:

 

28/12/2018, 16:33:02 IST session *0x1000ae5465c0007* cxid 0x1b zxid 
0x180000006f *create* 
'/failsafe/default/sites/prod/leader-latch/_c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-*0000000083*,#7036,v{s{31,s

{'world,'anyone}

}},T,84

 

So 0000000083 was created by session 0x1000ae5465c0007 at 16:33:02, as reported 
by the ZK servers.

And even though session 0x1000ae5465c0007 was decided as dead by Curator on 
16:33:14 (and started new session 0x1000c6bc82c0001 on 16:33:17), the callback 
of create 0000000083 seems to be executed on 16:33:17. 

At this point in time, session 0x1000ae5465c0007 was supposed to be considered 
as dead and unused by Curator, but still, somehow, the callback, which was 
created by that session (probably on 16:33:02 or before), made it running (by 
the closed EventThread of ClientCnxn?? I mean, we already saw " 2018-12-28 
16:33:14 INFO  ClientCnxn:522 - EventThread shut down for session: 
0x1000ae5465c0007").

And that stale callback corrupted the state of LeaderLatch.


was (Author: shayshim):
The ZK transactions log showed me the association between a ZNode and its 
session. For example:

 

28/12/2018, 16:33:02 IST session *0x1000ae5465c0007* cxid 0x1b zxid 
0x180000006f *create* 
'/failsafe/default/sites/prod/leader-latch/_c_0e156808-69bf-4932-b0fd-12d0fb407365-latch-*0000000083*,#7036,v{s{31,s

{'world,'anyone}}},T,84

 

So 0000000083 was created by session 0x1000ae5465c0007 16:33:02, as reported by 
the ZK servers.

And even though session 0x1000ae5465c0007 was decided as dead by Curator on 
16:33:14 (and started new session 0x1000c6bc82c0001 on 16:33:17), the callback 
of create 0000000083 seems to be executed on 16:33:17. 

At this point in time, session 0x1000ae5465c0007 was supposed to be considered 
as dead and unused by Curator, but still, somehow, the callback, which was 
created by that session (probably on 16:33:02 or before), made it running (by 
the closed EventThread of ClientCnxn?? I mean, we already saw " 2018-12-28 
16:33:14 INFO  ClientCnxn:522 - EventThread shut down for session: 
0x1000ae5465c0007").

And that stale callback corrupted the state of LeaderLatch.

> LeaderLatch deletes leader and leaves it hung besides a second leader
> ---------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 
> 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Major
>         Attachments: HaWatcher.log, LeaderLatch0.java, ha.tar.gz, logs.tar.gz
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out 
> of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a 
> while and then restore it, then after Curator in my app restores it's 
> connection to ZK - sometimes not all the 6 clients are found in the latch 
> path (using zkCli.sh). That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the 
> leader in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, 
> and from them it seems like very old create() background callback was 
> surprisingly scheduled and corrupted the current leader with its stale path 
> name. Meaning, this old one called setNode with its stale name, and set 
> itself instead of the leader and deleted the leader. This leaves client 
> running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete 
> create callback cancelled (I think its session was suspended on 22:38:54 and 
> then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks).
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that on 22:39:26 it shows that 0000000485 is replaced by 
> 0000000480 and then probably deleted.
> Note also that at 22:38:52, 34 seconds before, we can see that it was in the 
> reset() method ("RESET OUR PATH") and possibly triggered the creation of 
> 0000000480 then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to