[ 
https://issues.apache.org/jira/browse/CURATOR-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15682042#comment-15682042
 ] 

ASF GitHub Bot commented on CURATOR-358:
----------------------------------------

GitHub user cammckenzie opened a pull request:

    https://github.com/apache/curator/pull/173

    CURATOR-358 - Fixed race condition with getLeader()

    -If leadership changes between the getParticipantNodes() call and the 
getLeader() internal call the NoNodeException is now handled and the next child 
in the list is evaluated.
    
    Another option would be to just return the default empty Participant object 
and not iterate over the whole list of participants.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-358

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/173.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #173
    
----
commit 3478aca7ed6852484b5574a6082f4bb75c04a1e0
Author: Cam McKenzie <[email protected]>
Date:   2016-11-20T23:38:15Z

    CURATOR-358 - Fixed race condition with getLeader()
    -If leadership changes between the getParticipantNodes() call and the 
getLeader() internal call the NoNodeException is now handled and the next child 
in the list is evaluated.

----


> Receiving KeeperException with NoNode when LeaderLatch#getLeader()
> ------------------------------------------------------------------
>
>                 Key: CURATOR-358
>                 URL: https://issues.apache.org/jira/browse/CURATOR-358
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.10.0
>            Reporter: Satish Duggana
>            Priority: Critical
>
> org.apache.curator.framework.recipes.leader.LeaderLatch#getLeader() throws 
> KeeperException with Code#NONODE intermittently as mentioned in the stack 
> trace below. It may be possible  participant's ephemeral ZK node is removed 
> because its connection/session is closed. 
> You can see the below code at 
> https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L451
> public Participant getLeader() throws Exception
> {
>     Collection<String> participantNodes = 
> LockInternals.getParticipantNodes(client, latchPath, LOCK_NAME, sorter);
>     return LeaderSelector.getLeader(client, participantNodes);
> }
> I guess it hits a race condition where a participant node is retrieved but 
> when it invokes LeaderSelector#getLeader() it would have been removed because 
> of session timeout and it throws KeeperException with NoNode code. It does 
> not retry as the RetryLoop retries only for connection/session timeouts. But 
> in this case, NoNode should have been retried. I could not find any APIs on 
> CuratorClient to configure the kind of KeeperException codes to be retried. 
> It may be good to have a way to take what kind of errors should be retried in 
> org.apache.curator.framework.CuratorFrameworkFactory.Builder APIs. 
> Intermittent Exception found with the stack trace:
> 2016-11-15 06:09:33.954 o.a.s.d.nimbus [ERROR] Error when processing event
> org.apache.storm.shade.org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /storm/leader-lock/_c_97c09eed-5bba-4ac8-a05f-abdc4e8e95cf-latch-0000000002
>      at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
>      at 
> org.apache.storm.shade.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>      at 
> org.apache.storm.shade.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:304)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:293)
>      at 
> org.apache.storm.shade.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:108)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:290)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:281)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:42)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.participantForPath(LeaderSelector.java:375)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderSelector.getLeader(LeaderSelector.java:346)
>      at 
> org.apache.storm.shade.org.apache.curator.framework.recipes.leader.LeaderLatch.getLeader(LeaderLatch.java:454)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to