[
https://issues.apache.org/jira/browse/CURATOR-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720824#comment-16720824
]
ASF GitHub Bot commented on CURATOR-495:
----------------------------------------
GitHub user Randgalt opened a pull request:
https://github.com/apache/curator/pull/297
[CURATOR-495] Fixes race in many Curator recipes which could block ZK's
event thread
Fixes race in many Curator recipes whereby a pattern was used that called
"notifyAll" in a synchronized block in response to a ZooKeeper watcher
callback. This created a race and possible deadlock if the recipe instance was
already in a synchronized block. This would result in the ZK event thread
getting blocked which would prevent ZK connections from getting repaired. This
change adds a new executor (available from CuratorFramework) that can be used
to do the sync/notify so that ZK's event thread is not blocked.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/curator CURATOR-495
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/curator/pull/297.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #297
----
commit f91adb22e83d8e47b99ad98f8c13c86251bc4cb3
Author: randgalt <randgalt@...>
Date: 2018-12-14T01:34:40Z
CURATOR-495
Fixes race in many Curator recipes whereby a pattern was used that called
"notifyAll" in a synchronized block in response to a ZooKeeper watcher
callback. This created a race and possible deadlock if the recipe instance was
already in a synchronized block. This would result in the ZK event thread
getting blocked which would prevent ZK connections from getting repaired. This
change adds a new executor (available from CuratorFramework) that can be used
to do the sync/notify so that ZK's event thread is not blocked.
----
> Race and possible dead locks with RetryPolicies and several Curator Recipes
> ---------------------------------------------------------------------------
>
> Key: CURATOR-495
> URL: https://issues.apache.org/jira/browse/CURATOR-495
> Project: Apache Curator
> Issue Type: Bug
> Components: Recipes
> Affects Versions: 4.0.1
> Reporter: Jordan Zimmerman
> Assignee: Jordan Zimmerman
> Priority: Blocker
> Fix For: 4.1.0
>
>
> In trying to figure out why {{TestInterProcessSemaphoreMutex}} is so flakey
> I've come across a fairly serious edge case in how several of our recipes
> work. You can see the issue in {{InterProcessSemaphoreV2}} (which is what
> {{InterProcessSemaphoreMutex}} uses internally). Look here:
> [InterProcessSemaphoreV2.java|https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373]
> The code synchronizes and then does {{client.getChildren()...}}. This is
> where the problem is. If there are connection problems inside of
> getChildren() the retry policy will do configured sleeping, retries, etc.
> Importantly, this is all done while the thread doing the retries holds
> InterProcessSemaphoreV2's monitor. If the ZK connection is repaired past the
> session timeout, ZK will eventually call InterProcessSemaphoreV2's watcher
> with an Expired message. InterProcessSemaphoreV2's watcher calls this method:
> {code}
> private synchronized void notifyFromWatcher()
> {
> notifyAll();
> }
> {code}
> You can see that this is a race. The thread doing "getChildren" is holding
> the monitor and is in a retry loop waiting for the connection to be repaired.
> However, ZK's event loop is trying to obtain that same monitor as a result of
> trying to call the synchronized notifyFromWatcher(). This means that the
> retry policy will always fail because ZK's event loop is tied up until that
> thread exists. Worse still, if someone were to use a retry policy of
> "RetryForever" they'd have a deadlock.
> This pattern is in about 10 files or so. I'm trying to think of a workaround.
> One possibility is to use a separate thread for this type of notification.
> i.e. notifyFromWatcher() would just signal another thread that the
> notifyAll() needs to be called. This would unblock ZK's event thread so that
> things can progress. I'll play around with this.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)