Nice one, thanks for sorting this out Jordan. I'll try and take a look at the PR in the next couple of days. cheers
On Sat, Dec 15, 2018 at 5:20 AM Jordan Zimmerman <[email protected]> wrote: > Wow - 3 successful runs in a row on Jenkins. That's something we haven't > seen with Curator for years. This is looking good. > > > On Dec 14, 2018, at 7:52 AM, Jordan Zimmerman < > [email protected]> wrote: > > > > Here's the PR for this: > > > > https://github.com/apache/curator/pull/297 > > > > I've run the tests a few times and it seems good. Another pair of eyes > or two would be appreciated. > > > > -Jordan > > > >> On Dec 13, 2018, at 2:09 PM, Jordan Zimmerman < > [email protected]> wrote: > >> > >> FYI > >> > >> I was trying to figure out why TestInterProcessSemaphoreMutex is so > flakey. In debugging the problem I've come across a fairly serious edge > case in how several of our recipes work. You can see the issue in > InterProcessSemaphoreV2 (which is what InterProcessSemaphoreMutex uses > internally). Have a look here: > >> > >> > https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373 > >> > >> The code synchronizes and then does "client.getChildren()...". This is > where the problem is. If there are connection problems inside of > getChildren() the retry policy will do configured sleeping, retries, etc. > Importantly, this is all done while the thread doing the retires holds > InterProcessSemaphoreV2's monitor. If the ZK connection is repaired past > the session timeout, ZK will eventually call InterProcessSemaphoreV2's > watcher with an Expired message. InterProcessSemaphoreV2's watcher calls > this method: > >> > >> private synchronized void notifyFromWatcher() > >> { > >> notifyAll(); > >> } > >> > >> You can see that this is a race. The thread doing "getChildren" is > holding the monitor and is in a retry loop waiting for the connection to be > repaired. However, ZK's event loop is trying to obtain that same monitor as > a result of trying to call the synchronized notifyFromWatcher(). This means > that the retry policy will always fail because ZK's event loop is tied up > until that thread exists. Worse still, if someone were to use a retry > policy of "RetryForever" they'd have a deadlock. > >> > >> This pattern is in about 10 files or so. I'm trying to think of a > workaround. One possibility is to use a separate thread for this type of > notification. i.e. notifyFromWatcher() would just signal another thread > that the notifyAll() needs to be called. This would unblock ZK's event > thread so that things can progress. I'll play around with this. > >> > >> If anyone has a better idea or thoughts I'd appreciate it. > >> > >> -Jordan > >> > > > >
