Here's the PR for this:

        https://github.com/apache/curator/pull/297

I've run the tests a few times and it seems good. Another pair of eyes or two 
would be appreciated.

-Jordan

> On Dec 13, 2018, at 2:09 PM, Jordan Zimmerman <[email protected]> 
> wrote:
> 
> FYI
> 
> I was trying to figure out why TestInterProcessSemaphoreMutex is so flakey. 
> In debugging the problem I've come across a fairly serious edge case in how 
> several of our recipes work. You can see the issue in InterProcessSemaphoreV2 
> (which is what InterProcessSemaphoreMutex uses internally). Have a look here:
> 
>       
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373
> 
> The code synchronizes and then does "client.getChildren()...". This is where 
> the problem is. If there are connection problems inside of getChildren() the 
> retry policy will do configured sleeping, retries, etc. Importantly, this is 
> all done while the thread doing the retires holds InterProcessSemaphoreV2's 
> monitor. If the ZK connection is repaired past the session timeout, ZK will 
> eventually call InterProcessSemaphoreV2's watcher with an Expired message. 
> InterProcessSemaphoreV2's watcher calls this method:
> 
>       private synchronized void notifyFromWatcher()
>       {
>           notifyAll();
>       }
> 
> You can see that this is a race. The thread doing "getChildren" is holding 
> the monitor and is in a retry loop waiting for the connection to be repaired. 
> However, ZK's event loop is trying to obtain that same monitor as a result of 
> trying to call the synchronized notifyFromWatcher(). This means that the 
> retry policy will always fail because ZK's event loop is tied up until that 
> thread exists. Worse still, if someone were to use a retry policy of 
> "RetryForever" they'd have a deadlock. 
> 
> This pattern is in about 10 files or so. I'm trying to think of a workaround. 
> One possibility is to use a separate thread for this type of notification. 
> i.e. notifyFromWatcher() would just signal another thread that the 
> notifyAll() needs to be called. This would unblock ZK's event thread so that 
> things can progress. I'll play around with this.
> 
> If anyone has a better idea or thoughts I'd appreciate it.
> 
> -Jordan
> 

Reply via email to