[
https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jordan Zimmerman resolved CURATOR-209.
--------------------------------------
Resolution: Fixed
> Background retry falls into infinite loop of reconnection after connection
> loss
> -------------------------------------------------------------------------------
>
> Key: CURATOR-209
> URL: https://issues.apache.org/jira/browse/CURATOR-209
> Project: Apache Curator
> Issue Type: Bug
> Components: Framework
> Affects Versions: 2.6.0
> Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on
> AWS EC2 in a 3 box ensemble
> Reporter: Ryan Anderson
> Priority: Critical
> Labels: connectionloss, loop, reconnect
> Fix For: 2.9.2
>
>
> We've been unable to replicate this in our test environments, but
> approximately once a week in production (~50 machine cluster using curator/zk
> for service discovery) we will get a machine falling into a loop and spewing
> tens of thousands of errors that look like:
> {code}
> Background operation retry gave
> uporg.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605)
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
> [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we
> stop the process (starts at 10-20/sec, when we kill the box it's typically
> generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave
> uporg.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
> ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
> [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_55]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException:
> KeeperErrorCode = ConnectionLoss
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
> [curator-framework-2.6.0.jar:na]
> at
> org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
> [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_55]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're
> assuming that some sort of networking hiccup is occurring in EC2 that's
> causing the ConnectionLoss, which seems entirely momentary (none of our other
> boxes see it, and when we check the box it can connect to all the zk servers
> without any issues.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)