[ 
https://issues.apache.org/jira/browse/CURATOR-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13692407#comment-13692407
 ] 

Julio Lopez commented on CURATOR-15:
------------------------------------

Here is an occurrence, caused by UnknownHostException.  Perhaps, either 
LeaderSelector or InterProcessMutex should handle these cases and retry.

{{
E 06-23 03:30:07.106 LeaderSelector-0 c.n.c.f.r.l.LeaderSelector:349 |::] 
mutex.acquire() threw an exception
java.net.UnknownHostException: xyz.example.com
        at java.net.InetAddress.getAllByName0(Unknown Source) ~[...]
        at java.net.InetAddress.getAllByName(Unknown Source) ~[...]
        at java.net.InetAddress.getAllByName(Unknown Source) ~[...]
        at 
org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:60)
 ~[...]
        at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445) ~[...]
        at 
com.netflix.curator.utils.DefaultZookeeperFactory.newZooKeeper(DefaultZookeeperFactory.java:27)
 ~[...]
        at 
com.netflix.curator.framework.imps.CuratorFrameworkImpl$2.newZooKeeper(CuratorFrameworkImpl.java:166)
 ~[...]
        at 
com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:94) ~[...]
        at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:55) 
~[...]
        at 
com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:112) 
~[...]
        at 
com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:107)
 ~[...]
        at 
com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:448)
 ~[...]
        at 
com.netflix.curator.framework.imps.CreateBuilderImpl$10.call(CreateBuilderImpl.java:625)
 ~[...]
        at 
com.netflix.curator.framework.imps.CreateBuilderImpl$10.call(CreateBuilderImpl.java:609)
 ~[...]
        at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:106) 
~[...]
        at 
com.netflix.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:605)
 ~[...]
        at 
com.netflix.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:428)
 ~[...]
        at 
com.netflix.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:41)
 ~[...]
        at 
com.netflix.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:218)
 ~[...]
        at 
com.netflix.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:218)
 ~[...]
        at 
com.netflix.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:74)
 ~[...]
        at 
com.netflix.curator.framework.recipes.leader.LeaderSelector.doWork(LeaderSelector.java:313)
 [...]
        at 
com.netflix.curator.framework.recipes.leader.LeaderSelector.doWorkLoop(LeaderSelector.java:374)
 [...]
        at 
com.netflix.curator.framework.recipes.leader.LeaderSelector.access$100(LeaderSelector.java:45)
 [...]
        at 
com.netflix.curator.framework.recipes.leader.LeaderSelector$2.call(LeaderSelector.java:194)
 [...]
        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) 
[na:1.6.0_32]
        at java.util.concurrent.FutureTask.run(Unknown Source) [na:1.6.0_32]
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown 
Source) [na:1.6.0_32]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
[na:1.6.0_32]
}}
                
> LeaderSelector may (undetectably) fail to elect
> -----------------------------------------------
>
>                 Key: CURATOR-15
>                 URL: https://issues.apache.org/jira/browse/CURATOR-15
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.0.0-incubating
>            Reporter: Shevek
>             Fix For: TBD
>
>
> In LeaderSelector, if mutex.acquire() throws an Exception, for example 
> because CuratorFramework.getZooKeeper() threw a previously-enqueued 
> background exception, then that failure will propagate out of doWork and 
> doWorkLoop, and kill the background submission onto the executor service.
> This means that a leaderselector which was start()ed will NEVER elect, and 
> this situation is NOT DETECTABLE externally, since that exception happens on 
> a private executorservice thread and is not client visible. It's impossible 
> to look at a LeaderSelector and decide whether it is still "viable".
> This can leave a machine/process "hung" and not automatically recoverable 
> within curator.
> Either isQueued() needs to be exposed, which means that a leader is either 
> elected or queued; or the finally{} block which calls clearIsQueued() needs 
> also to set state to CLOSED or FAILED, so that we can query this failure 
> externally.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to