[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120297#comment-16120297
 ] 

Hadoop QA commented on ZOOKEEPER-2471:
--------------------------------------

-1 overall.  GitHub Pull Request  Build
      

    +1 @author.  The patch does not contain any @author tags.

    +0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/929//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/929//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/929//console

This message is automatically generated.

> Java Zookeeper Client incorrectly considers time spent sleeping as time spent 
> connecting, potentially resulting in infinite reconnect loop
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2471
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.5.1
>         Environment: all
>            Reporter: Dan Benediktson
>            Assignee: Dan Benediktson
>         Attachments: ZOOKEEPER-2471.patch
>
>
> ClientCnxnSocket uses a member variable "now" to track the current time, and 
> lastSend / lastHeard variables to track socket liveness. Implementations, and 
> even ClientCnxn itself, are expected to call both updateNow() to reset "now" 
> to System.currentTimeMillis, and then call updateLastSend()/updateLastHeard() 
> on IO completions.
> This is a fragile contract, so it's not surprising that there's a bug 
> resulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() 
> as soon as startConnect() returns, but it does not call updateNow() first. I 
> expect when this was written, either the expectation was that startConnect() 
> was an asynchronous operation and that updateNow() would have been called 
> very recently, or simply the requirement to call updateNow() was forgotten at 
> this point. As far as I can see, this bug has been present since the 
> "updateNow" method was first introduced in the distant past. As it turns out, 
> since startConnect() calls HostProvider.next(), which can sleep, quite a lot 
> of time can pass, leaving a big gap between "now" and now.
> If you are using very short session timeouts (one of our ZK ensembles has 
> many clients using a 1-second timeout), this is potentially disastrous, 
> because the sleep time may exceed the connection timeout itself, which can 
> potentially result in the Java client being stuck in a perpetual reconnect 
> loop. The exact code path it goes through in this case is complicated, 
> because there has to be a previously-closed socket still waiting in the 
> selector (otherwise, the first timeout evaluation will not fail because "now" 
> still hasn't been updated, and then the actual connect timeout will be 
> applied in ClientCnxnSocket.doTransport()) so that select() will harvest the 
> IO from the previous socket and updateNow(), resulting in the next loop 
> through ClientCnxnSocket.SendThread.run() observing the spurious timeout and 
> failing. In practice it does happen to us fairly frequently; we only got to 
> the bottom of the bug yesterday. Worse, when it does happen, the Zookeeper 
> client object is rendered unusable: it's stuck in a perpetual reconnect loop 
> where it keeps sleeping, opening a socket, and immediately closing it.
> I have a patch. Rather than calling updateNow() right after startConnect(), 
> my fix is to remove the "now" member variable and the updateNow() method 
> entirely, and to instead just call System.currentTimeMillis() whenever time 
> needs to be evaluated. I realize there is a benefit (aside from a trivial 
> micro-optimization not worth worrying about) to having the time be "fixed", 
> particularly for truth in the logging: if time is fixed by an updateNow() 
> call, then the log for a timeout will still show exactly the same value the 
> code reasoned about. However, this benefit is in my opinion not enough to 
> merit the fragility of the contract which led to this (for us) highly 
> impactful and difficult-to-find bug in the first place.
> I'm currently running ant tests locally against my patch on trunk, and then 
> I'll upload it here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to