[ 
https://issues.apache.org/jira/browse/CURATOR-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kezhu Wang reassigned CURATOR-604:
----------------------------------

    Assignee: Kezhu Wang

> Double locking issue while using InterProcessMutex lock code
> ------------------------------------------------------------
>
>                 Key: CURATOR-604
>                 URL: https://issues.apache.org/jira/browse/CURATOR-604
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.5.0
>            Reporter: Viswanathan Rajagopal
>            Assignee: Kezhu Wang
>            Priority: Minor
>
> Recently came across “Double Locking Issue (i.e. two clients acquiring lock)” 
> using Curator code ( InterProcessMutex lock APIs ) in our application
> *+Our use case:+*
>  * Two clients attempts to acquire the zookeeper lock using Curator 
> InterProcessMutex and whoever owns it would release it once sees the 
> connection disconnect ( on receiving Connection.SUSPENDED / Connection.LOST 
> Curator Connection Events from Connection listener)
> *+Issue we noticed:+*
>  * After session expired & reconnected with new session, both client seems to 
> have acquired the lock. Interesting thing that we found is that one of the 
> clients still holds the lock while its lock node (ephemeral) was gone
> *+Things we found:+*
>  * Based on our initial analysis and few test runs, we saw that *_Curator 
> acquire() method acquires the lock based on “about to be deleted lock node of 
> previous session”._* *Explanation :* Ephemeral node created by previous 
> session was  still seen by client that reconnected with new session id until 
> server cleans that up. If this happens, Curator acquire() would hold the lock.
>  
>  * Clearly we could see the *_race condition (in zookeeper code) between 1). 
> Client reconnecting to server with new session id and 2). server deleting the 
> ephemeral nodes of client’s previous session_*.
>  
>  * On the above mentioned race condition, if client manage to reconnect to 
> server with new session id before server cleans up the ephemeral nodes of 
> client’s previous session,  Curator lock acquire() who is trying to acquire 
> the lock will hold the lock as it still sees the lock node in zookeeper 
> directory. Eventually server would be cleaning up the ephemeral nodes leaving 
> the Curator local lock thread data stale giving the illusion that it still 
> hold the lock while its ephemeral node is gone
>  
>  * Trying to categorize the timeline events below to see if that would help 
> us for better understanding,
> [ CONNECTED ] :
>  ** Client A & Client B calls acquire creating Node N1 & N2 respectively
>  ** Client A acquire() -> holds the lock as its Node N1 is first node in 
> sequence
>  ** Client B acquire() -> created Node N2, watches for previous sequence node 
> (blocked on wait())
>  
> [ SUSPENDED CONNECTION OCCURS ] :
>  ** Network partition happens
>  ** Curator fires SUSPENDED
>  ** Client A, on receiving SUSPENDED event, attempts to release the lock
>  ** Client B, on receiving SUSPENDED event, does nothing as it was not 
> holding any lock (as blocked on acquire() -> wait() )  
>  
> [ SESSION EXPIRED ] :
>  ** Client A attempts to release the lock (with retries)
>  ** Client B does nothing as it was not holding any lock (and is still 
> blocked on acquire() -> wait() )
>  ** Server prepared to cleanup previous client session and deleting lock 
> nodes created as part of previous client session
>  
> [ RECONNECTED ] :
>  ** Client A and Client B reconnected to another server with new session id 
> (Note : This happens before server managed to cleanup / delete ephemeral 
> nodes of previous client session)
>  ** Client A released the lock successfully (means it would delete its lock 
> node N1) and attempts to acquire lock by creating lock node N3 and watches 
> for previous sequence node (N2)
>  ** Client B who was blocked on acquire() -> wait() would be notified with 
> previous sequence node (N1) deletion -> getChildren, sorting them, and seeing 
> if the lock's node is the first node in the sequence. So, Client B sees its 
> lock node N2 still (which I call it as about to be deleted node by server) 
> and thereby acquires the lock
>  
> [ AFTER FEW SECONDS ] :
>  ** Server managed to delete the ephemeral node N2 as part of previous client 
> session cleanup
>  ** Client A who was blocked on acquire() -> wait() would be notified with 
> previous sequence node (N2) deletion -> getChildren, sorting them, and seeing 
> if the lock's node is the first node in the sequence and thereby acquires the 
> lock
>  ** Client B – its local lock thread data went stale (as its lock path N2 not 
> has been deleted by Server)**
>  
>  * *Created a test case simulator* 
> [https://github.com/ViswaNXplore/curator/blob/curator-2.5.0/patch/double_locking_test_simulator/curator-recipes/src/test/java/org/apache/curator/framework/recipes/CuratorRecipeTest.java]
>  
> +Approach #1 (Artificial way of reproducing the issue)+
>  ## Run the test case with (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to true). 
> This would delete the lock node after a pause just to simulate the state 
> where the zkClient had reconnected and it still happens to see the ephermal 
> node just before the server deletes it since its session has expired, but the 
> node is deleted afterwards by the server.
> +Approach #2 (A little manual interruption needed to reproduce the issue)+
>  ## Run the test case in Debug mode (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to 
> false)
>  ## Artificially delaying / pausing the ephemeral lock nodes deletion as part 
> of session cleanup process in server code (ZookeeperServer.close() method)
>  ## After a pause (say 5s) to make one of the instance to acquire the lock, 
> Artificially break the socket connection between client and server for 30s 
> (by keeping breakpoint in ClientCnxnSocketNIO.doIO() method). After 30s, we 
> would see session closing logs logged in server code
>  ## After 1min, remove breakpoint in ClientCnxnSocketNIO.doIO() and resume 
> both Thread 2 and Thread 3
>  ## After that, resume server thread (thread name would be “SessionTracker”
> *+Note:+* 
>  * We are not certain that this race condition that we noticed in zookeeper 
> code is intentional design.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to