[ https://issues.apache.org/jira/browse/CURATOR-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kezhu Wang reassigned CURATOR-604: ---------------------------------- Assignee: Kezhu Wang > Double locking issue while using InterProcessMutex lock code > ------------------------------------------------------------ > > Key: CURATOR-604 > URL: https://issues.apache.org/jira/browse/CURATOR-604 > Project: Apache Curator > Issue Type: Bug > Components: Recipes > Affects Versions: 2.5.0 > Reporter: Viswanathan Rajagopal > Assignee: Kezhu Wang > Priority: Minor > > Recently came across “Double Locking Issue (i.e. two clients acquiring lock)” > using Curator code ( InterProcessMutex lock APIs ) in our application > *+Our use case:+* > * Two clients attempts to acquire the zookeeper lock using Curator > InterProcessMutex and whoever owns it would release it once sees the > connection disconnect ( on receiving Connection.SUSPENDED / Connection.LOST > Curator Connection Events from Connection listener) > *+Issue we noticed:+* > * After session expired & reconnected with new session, both client seems to > have acquired the lock. Interesting thing that we found is that one of the > clients still holds the lock while its lock node (ephemeral) was gone > *+Things we found:+* > * Based on our initial analysis and few test runs, we saw that *_Curator > acquire() method acquires the lock based on “about to be deleted lock node of > previous session”._* *Explanation :* Ephemeral node created by previous > session was still seen by client that reconnected with new session id until > server cleans that up. If this happens, Curator acquire() would hold the lock. > > * Clearly we could see the *_race condition (in zookeeper code) between 1). > Client reconnecting to server with new session id and 2). server deleting the > ephemeral nodes of client’s previous session_*. > > * On the above mentioned race condition, if client manage to reconnect to > server with new session id before server cleans up the ephemeral nodes of > client’s previous session, Curator lock acquire() who is trying to acquire > the lock will hold the lock as it still sees the lock node in zookeeper > directory. Eventually server would be cleaning up the ephemeral nodes leaving > the Curator local lock thread data stale giving the illusion that it still > hold the lock while its ephemeral node is gone > > * Trying to categorize the timeline events below to see if that would help > us for better understanding, > [ CONNECTED ] : > ** Client A & Client B calls acquire creating Node N1 & N2 respectively > ** Client A acquire() -> holds the lock as its Node N1 is first node in > sequence > ** Client B acquire() -> created Node N2, watches for previous sequence node > (blocked on wait()) > > [ SUSPENDED CONNECTION OCCURS ] : > ** Network partition happens > ** Curator fires SUSPENDED > ** Client A, on receiving SUSPENDED event, attempts to release the lock > ** Client B, on receiving SUSPENDED event, does nothing as it was not > holding any lock (as blocked on acquire() -> wait() ) > > [ SESSION EXPIRED ] : > ** Client A attempts to release the lock (with retries) > ** Client B does nothing as it was not holding any lock (and is still > blocked on acquire() -> wait() ) > ** Server prepared to cleanup previous client session and deleting lock > nodes created as part of previous client session > > [ RECONNECTED ] : > ** Client A and Client B reconnected to another server with new session id > (Note : This happens before server managed to cleanup / delete ephemeral > nodes of previous client session) > ** Client A released the lock successfully (means it would delete its lock > node N1) and attempts to acquire lock by creating lock node N3 and watches > for previous sequence node (N2) > ** Client B who was blocked on acquire() -> wait() would be notified with > previous sequence node (N1) deletion -> getChildren, sorting them, and seeing > if the lock's node is the first node in the sequence. So, Client B sees its > lock node N2 still (which I call it as about to be deleted node by server) > and thereby acquires the lock > > [ AFTER FEW SECONDS ] : > ** Server managed to delete the ephemeral node N2 as part of previous client > session cleanup > ** Client A who was blocked on acquire() -> wait() would be notified with > previous sequence node (N2) deletion -> getChildren, sorting them, and seeing > if the lock's node is the first node in the sequence and thereby acquires the > lock > ** Client B – its local lock thread data went stale (as its lock path N2 not > has been deleted by Server)** > > * *Created a test case simulator* > [https://github.com/ViswaNXplore/curator/blob/curator-2.5.0/patch/double_locking_test_simulator/curator-recipes/src/test/java/org/apache/curator/framework/recipes/CuratorRecipeTest.java] > > +Approach #1 (Artificial way of reproducing the issue)+ > ## Run the test case with (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to true). > This would delete the lock node after a pause just to simulate the state > where the zkClient had reconnected and it still happens to see the ephermal > node just before the server deletes it since its session has expired, but the > node is deleted afterwards by the server. > +Approach #2 (A little manual interruption needed to reproduce the issue)+ > ## Run the test case in Debug mode (SIMULATE_ISSUE_BY_EXPLICIT_DELETE set to > false) > ## Artificially delaying / pausing the ephemeral lock nodes deletion as part > of session cleanup process in server code (ZookeeperServer.close() method) > ## After a pause (say 5s) to make one of the instance to acquire the lock, > Artificially break the socket connection between client and server for 30s > (by keeping breakpoint in ClientCnxnSocketNIO.doIO() method). After 30s, we > would see session closing logs logged in server code > ## After 1min, remove breakpoint in ClientCnxnSocketNIO.doIO() and resume > both Thread 2 and Thread 3 > ## After that, resume server thread (thread name would be “SessionTracker” > *+Note:+* > * We are not certain that this race condition that we noticed in zookeeper > code is intentional design. -- This message was sent by Atlassian Jira (v8.20.10#820010)