Nice one, thanks for sorting this out Jordan. I'll try and take a look at
the PR in the next couple of days.
cheers

On Sat, Dec 15, 2018 at 5:20 AM Jordan Zimmerman <[email protected]>
wrote:

> Wow - 3 successful runs in a row on Jenkins. That's something we haven't
> seen with Curator for years. This is looking good.
>
> > On Dec 14, 2018, at 7:52 AM, Jordan Zimmerman <
> [email protected]> wrote:
> >
> > Here's the PR for this:
> >
> >       https://github.com/apache/curator/pull/297
> >
> > I've run the tests a few times and it seems good. Another pair of eyes
> or two would be appreciated.
> >
> > -Jordan
> >
> >> On Dec 13, 2018, at 2:09 PM, Jordan Zimmerman <
> [email protected]> wrote:
> >>
> >> FYI
> >>
> >> I was trying to figure out why TestInterProcessSemaphoreMutex is so
> flakey. In debugging the problem I've come across a fairly serious edge
> case in how several of our recipes work. You can see the issue in
> InterProcessSemaphoreV2 (which is what InterProcessSemaphoreMutex uses
> internally). Have a look here:
> >>
> >>
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373
> >>
> >> The code synchronizes and then does "client.getChildren()...". This is
> where the problem is. If there are connection problems inside of
> getChildren() the retry policy will do configured sleeping, retries, etc.
> Importantly, this is all done while the thread doing the retires holds
> InterProcessSemaphoreV2's monitor. If the ZK connection is repaired past
> the session timeout, ZK will eventually call InterProcessSemaphoreV2's
> watcher with an Expired message. InterProcessSemaphoreV2's watcher calls
> this method:
> >>
> >>      private synchronized void notifyFromWatcher()
> >>      {
> >>          notifyAll();
> >>      }
> >>
> >> You can see that this is a race. The thread doing "getChildren" is
> holding the monitor and is in a retry loop waiting for the connection to be
> repaired. However, ZK's event loop is trying to obtain that same monitor as
> a result of trying to call the synchronized notifyFromWatcher(). This means
> that the retry policy will always fail because ZK's event loop is tied up
> until that thread exists. Worse still, if someone were to use a retry
> policy of "RetryForever" they'd have a deadlock.
> >>
> >> This pattern is in about 10 files or so. I'm trying to think of a
> workaround. One possibility is to use a separate thread for this type of
> notification. i.e. notifyFromWatcher() would just signal another thread
> that the notifyAll() needs to be called. This would unblock ZK's event
> thread so that things can progress. I'll play around with this.
> >>
> >> If anyone has a better idea or thoughts I'd appreciate it.
> >>
> >> -Jordan
> >>
> >
>
>

Reply via email to