[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133639#comment-16133639
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2471:
-------------------------------------------

Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/330
  
    The `now` variable was there when this project starts, and `updateNow` was 
introduced in ZOOKEEPER-909 which was just a refactoring so it's irrelevant to 
the core logic. I had a chat with @phunt who wrote the original code and the 
`now` was not introduced by accident - according to Pat it's used to solve two 
problems:
    * A performance optimization by caching as on some platforms getting 
current time is not cheap.
    * On some platforms get current time will go backwards so to prevent that a 
cached time was used at the very start.
    Now after 10 years and many changes (like using monotonic clock) the 
original design might become irrelevant, and it might make sense to instead 
calculate the now time when needed. Though, I'd like to have another look into 
this issue before making a conclusion.
    
    One idea is to have a test case that shows the bug (client infinity connect 
loop etc), and then verify that the bug gets fixed with the patch. This will be 
more convincing. I understand writing such test takes more effort than the 
patch itself, but eventually it's something must to have for such a change.
    
    On a side note - this patch is doing fine with my stress test and does not 
cause any regressions.
    



> Java Zookeeper Client incorrectly considers time spent sleeping as time spent 
> connecting, potentially resulting in infinite reconnect loop
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2471
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.5.3
>         Environment: all
>            Reporter: Dan Benediktson
>            Assignee: Dan Benediktson
>         Attachments: ZOOKEEPER-2471.patch
>
>
> ClientCnxnSocket uses a member variable "now" to track the current time, and 
> lastSend / lastHeard variables to track socket liveness. Implementations, and 
> even ClientCnxn itself, are expected to call both updateNow() to reset "now" 
> to System.currentTimeMillis, and then call updateLastSend()/updateLastHeard() 
> on IO completions.
> This is a fragile contract, so it's not surprising that there's a bug 
> resulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() 
> as soon as startConnect() returns, but it does not call updateNow() first. I 
> expect when this was written, either the expectation was that startConnect() 
> was an asynchronous operation and that updateNow() would have been called 
> very recently, or simply the requirement to call updateNow() was forgotten at 
> this point. As far as I can see, this bug has been present since the 
> "updateNow" method was first introduced in the distant past. As it turns out, 
> since startConnect() calls HostProvider.next(), which can sleep, quite a lot 
> of time can pass, leaving a big gap between "now" and now.
> If you are using very short session timeouts (one of our ZK ensembles has 
> many clients using a 1-second timeout), this is potentially disastrous, 
> because the sleep time may exceed the connection timeout itself, which can 
> potentially result in the Java client being stuck in a perpetual reconnect 
> loop. The exact code path it goes through in this case is complicated, 
> because there has to be a previously-closed socket still waiting in the 
> selector (otherwise, the first timeout evaluation will not fail because "now" 
> still hasn't been updated, and then the actual connect timeout will be 
> applied in ClientCnxnSocket.doTransport()) so that select() will harvest the 
> IO from the previous socket and updateNow(), resulting in the next loop 
> through ClientCnxnSocket.SendThread.run() observing the spurious timeout and 
> failing. In practice it does happen to us fairly frequently; we only got to 
> the bottom of the bug yesterday. Worse, when it does happen, the Zookeeper 
> client object is rendered unusable: it's stuck in a perpetual reconnect loop 
> where it keeps sleeping, opening a socket, and immediately closing it.
> I have a patch. Rather than calling updateNow() right after startConnect(), 
> my fix is to remove the "now" member variable and the updateNow() method 
> entirely, and to instead just call System.currentTimeMillis() whenever time 
> needs to be evaluated. I realize there is a benefit (aside from a trivial 
> micro-optimization not worth worrying about) to having the time be "fixed", 
> particularly for truth in the logging: if time is fixed by an updateNow() 
> call, then the log for a timeout will still show exactly the same value the 
> code reasoned about. However, this benefit is in my opinion not enough to 
> merit the fragility of the contract which led to this (for us) highly 
> impactful and difficult-to-find bug in the first place.
> I'm currently running ant tests locally against my patch on trunk, and then 
> I'll upload it here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to