FYI
I was trying to figure out why TestInterProcessSemaphoreMutex is so flakey. In
debugging the problem I've come across a fairly serious edge case in how
several of our recipes work. You can see the issue in InterProcessSemaphoreV2
(which is what InterProcessSemaphoreMutex uses internally). Have a look here:
https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373
The code synchronizes and then does "client.getChildren()...". This is where
the problem is. If there are connection problems inside of getChildren() the
retry policy will do configured sleeping, retries, etc. Importantly, this is
all done while the thread doing the retires holds InterProcessSemaphoreV2's
monitor. If the ZK connection is repaired past the session timeout, ZK will
eventually call InterProcessSemaphoreV2's watcher with an Expired message.
InterProcessSemaphoreV2's watcher calls this method:
private synchronized void notifyFromWatcher()
{
notifyAll();
}
You can see that this is a race. The thread doing "getChildren" is holding the
monitor and is in a retry loop waiting for the connection to be repaired.
However, ZK's event loop is trying to obtain that same monitor as a result of
trying to call the synchronized notifyFromWatcher(). This means that the retry
policy will always fail because ZK's event loop is tied up until that thread
exists. Worse still, if someone were to use a retry policy of "RetryForever"
they'd have a deadlock.
This pattern is in about 10 files or so. I'm trying to think of a workaround.
One possibility is to use a separate thread for this type of notification. i.e.
notifyFromWatcher() would just signal another thread that the notifyAll() needs
to be called. This would unblock ZK's event thread so that things can progress.
I'll play around with this.
If anyone has a better idea or thoughts I'd appreciate it.
-Jordan