[jira] [Commented] (CURATOR-495) Race and possible dead locks with RetryPolicies and several Curator Recipes

ASF GitHub Bot (JIRA) Sun, 16 Dec 2018 16:57:16 -0800


    [ 
https://issues.apache.org/jira/browse/CURATOR-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16722627#comment-16722627
 ]


ASF GitHub Bot commented on CURATOR-495:
----------------------------------------

Github user cammckenzie commented on a diff in the pull request:

    https://github.com/apache/curator/pull/297#discussion_r242005321
  
    --- Diff: 
curator-framework/src/main/java/org/apache/curator/framework/CuratorFrameworkFactory.java
 ---
    @@ -474,6 +477,28 @@ public Builder schemaSet(SchemaSet schemaSet)
                 return this;
             }
     
    +        /**
    +         * Curator (and user) recipes will use this executor to call 
notifyAll
    +         * and other blocking calls that might normally block ZooKeeper's 
event thread.
    +         * By default, an executor is allocated internally using the 
provided (or default)
    +         * {@link #threadFactory(java.util.concurrent.ThreadFactory)}. Use 
this method
    +         * to set a custom executor.
    +         *
    +         * @param runSafeService executor to use for calls to notifyAll 
from Watcher callbacks etc
    +         * @return this
    +         * @since 4.1.0
    +         */
    +        public Builder runSafeService(Executor runSafeService)
    +        {
    +            this.runSafeService = runSafeService;
    +            return null;
    --- End diff --
    
    This should return this rather than null


> Race and possible dead locks with RetryPolicies and several Curator Recipes
> ---------------------------------------------------------------------------
>
>                 Key: CURATOR-495
>                 URL: https://issues.apache.org/jira/browse/CURATOR-495
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 4.0.1
>            Reporter: Jordan Zimmerman
>            Assignee: Jordan Zimmerman
>            Priority: Blocker
>             Fix For: 4.1.0
>
>
> In trying to figure out why {{TestInterProcessSemaphoreMutex}} is so flakey 
> I've come across a fairly serious edge case in how several of our recipes 
> work. You can see the issue in {{InterProcessSemaphoreV2}} (which is what 
> {{InterProcessSemaphoreMutex}} uses internally). Look here:
> [InterProcessSemaphoreV2.java|https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373]
> The code synchronizes and then does {{client.getChildren()...}}. This is 
> where the problem is. If there are connection problems inside of 
> getChildren() the retry policy will do configured sleeping, retries, etc. 
> Importantly, this is all done while the thread doing the retries holds 
> InterProcessSemaphoreV2's monitor. If the ZK connection is repaired past the 
> session timeout, ZK will eventually call InterProcessSemaphoreV2's watcher 
> with an Expired message. InterProcessSemaphoreV2's watcher calls this method:
> {code}
> private synchronized void notifyFromWatcher()
> {
>     notifyAll();
> }
> {code}
> You can see that this is a race. The thread doing "getChildren" is holding 
> the monitor and is in a retry loop waiting for the connection to be repaired. 
> However, ZK's event loop is trying to obtain that same monitor as a result of 
> trying to call the synchronized notifyFromWatcher(). This means that the 
> retry policy will always fail because ZK's event loop is tied up until that 
> thread exists. Worse still, if someone were to use a retry policy of 
> "RetryForever" they'd have a deadlock.
> This pattern is in about 10 files or so. I'm trying to think of a workaround. 
> One possibility is to use a separate thread for this type of notification. 
> i.e. notifyFromWatcher() would just signal another thread that the 
> notifyAll() needs to be called. This would unblock ZK's event thread so that 
> things can progress. I'll play around with this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CURATOR-495) Race and possible dead locks with RetryPolicies and several Curator Recipes

Reply via email to