[ https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120512#comment-16120512 ]
Hadoop QA commented on ZOOKEEPER-2471: -------------------------------------- +1 overall. GitHub Pull Request Build +1 @author. The patch does not contain any @author tags. +0 tests included. The patch appears to be a documentation patch that doesn't require tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/932//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/932//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/932//console This message is automatically generated. > Java Zookeeper Client incorrectly considers time spent sleeping as time spent > connecting, potentially resulting in infinite reconnect loop > ------------------------------------------------------------------------------------------------------------------------------------------ > > Key: ZOOKEEPER-2471 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471 > Project: ZooKeeper > Issue Type: Bug > Components: java client > Affects Versions: 3.5.3 > Environment: all > Reporter: Dan Benediktson > Assignee: Dan Benediktson > Attachments: ZOOKEEPER-2471.patch > > > ClientCnxnSocket uses a member variable "now" to track the current time, and > lastSend / lastHeard variables to track socket liveness. Implementations, and > even ClientCnxn itself, are expected to call both updateNow() to reset "now" > to System.currentTimeMillis, and then call updateLastSend()/updateLastHeard() > on IO completions. > This is a fragile contract, so it's not surprising that there's a bug > resulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() > as soon as startConnect() returns, but it does not call updateNow() first. I > expect when this was written, either the expectation was that startConnect() > was an asynchronous operation and that updateNow() would have been called > very recently, or simply the requirement to call updateNow() was forgotten at > this point. As far as I can see, this bug has been present since the > "updateNow" method was first introduced in the distant past. As it turns out, > since startConnect() calls HostProvider.next(), which can sleep, quite a lot > of time can pass, leaving a big gap between "now" and now. > If you are using very short session timeouts (one of our ZK ensembles has > many clients using a 1-second timeout), this is potentially disastrous, > because the sleep time may exceed the connection timeout itself, which can > potentially result in the Java client being stuck in a perpetual reconnect > loop. The exact code path it goes through in this case is complicated, > because there has to be a previously-closed socket still waiting in the > selector (otherwise, the first timeout evaluation will not fail because "now" > still hasn't been updated, and then the actual connect timeout will be > applied in ClientCnxnSocket.doTransport()) so that select() will harvest the > IO from the previous socket and updateNow(), resulting in the next loop > through ClientCnxnSocket.SendThread.run() observing the spurious timeout and > failing. In practice it does happen to us fairly frequently; we only got to > the bottom of the bug yesterday. Worse, when it does happen, the Zookeeper > client object is rendered unusable: it's stuck in a perpetual reconnect loop > where it keeps sleeping, opening a socket, and immediately closing it. > I have a patch. Rather than calling updateNow() right after startConnect(), > my fix is to remove the "now" member variable and the updateNow() method > entirely, and to instead just call System.currentTimeMillis() whenever time > needs to be evaluated. I realize there is a benefit (aside from a trivial > micro-optimization not worth worrying about) to having the time be "fixed", > particularly for truth in the logging: if time is fixed by an updateNow() > call, then the log for a timeout will still show exactly the same value the > code reasoned about. However, this benefit is in my opinion not enough to > merit the fragility of the contract which led to this (for us) highly > impactful and difficult-to-find bug in the first place. > I'm currently running ant tests locally against my patch on trunk, and then > I'll upload it here. -- This message was sent by Atlassian JIRA (v6.4.14#64029)