[
https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133300#comment-16133300
]
ASF GitHub Bot commented on ZOOKEEPER-2471:
-------------------------------------------
Github user nicktrav commented on the issue:
https://github.com/apache/zookeeper/pull/330
Feels a little awkward to have a non-trivial change without any tests :/
> I'm still going to be stuck depending on the behavior of that
randomly-generated sleep time, which is going to be inherently unreliable
Thinking aloud here ... given your second commit updates to use the `Time`
utility, could you now inject that class, (or perhaps a faked implementation
given the methods are static), into `ClientCnxnSocket` and its subclasses? This
would make it more amenable to testing. You'd have more control over the times
that are returned.
You'd still need to plumb that through into `ClientCnxn`, but it looks like
the constructor takes a `ClientCnxnSocket`, which could work here?
Maybe we can get that scaffolding wired up in another PR, get that merged
in, and this change could leverage that?
> Java Zookeeper Client incorrectly considers time spent sleeping as time spent
> connecting, potentially resulting in infinite reconnect loop
> ------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2471
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471
> Project: ZooKeeper
> Issue Type: Bug
> Components: java client
> Affects Versions: 3.5.3
> Environment: all
> Reporter: Dan Benediktson
> Assignee: Dan Benediktson
> Attachments: ZOOKEEPER-2471.patch
>
>
> ClientCnxnSocket uses a member variable "now" to track the current time, and
> lastSend / lastHeard variables to track socket liveness. Implementations, and
> even ClientCnxn itself, are expected to call both updateNow() to reset "now"
> to System.currentTimeMillis, and then call updateLastSend()/updateLastHeard()
> on IO completions.
> This is a fragile contract, so it's not surprising that there's a bug
> resulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard()
> as soon as startConnect() returns, but it does not call updateNow() first. I
> expect when this was written, either the expectation was that startConnect()
> was an asynchronous operation and that updateNow() would have been called
> very recently, or simply the requirement to call updateNow() was forgotten at
> this point. As far as I can see, this bug has been present since the
> "updateNow" method was first introduced in the distant past. As it turns out,
> since startConnect() calls HostProvider.next(), which can sleep, quite a lot
> of time can pass, leaving a big gap between "now" and now.
> If you are using very short session timeouts (one of our ZK ensembles has
> many clients using a 1-second timeout), this is potentially disastrous,
> because the sleep time may exceed the connection timeout itself, which can
> potentially result in the Java client being stuck in a perpetual reconnect
> loop. The exact code path it goes through in this case is complicated,
> because there has to be a previously-closed socket still waiting in the
> selector (otherwise, the first timeout evaluation will not fail because "now"
> still hasn't been updated, and then the actual connect timeout will be
> applied in ClientCnxnSocket.doTransport()) so that select() will harvest the
> IO from the previous socket and updateNow(), resulting in the next loop
> through ClientCnxnSocket.SendThread.run() observing the spurious timeout and
> failing. In practice it does happen to us fairly frequently; we only got to
> the bottom of the bug yesterday. Worse, when it does happen, the Zookeeper
> client object is rendered unusable: it's stuck in a perpetual reconnect loop
> where it keeps sleeping, opening a socket, and immediately closing it.
> I have a patch. Rather than calling updateNow() right after startConnect(),
> my fix is to remove the "now" member variable and the updateNow() method
> entirely, and to instead just call System.currentTimeMillis() whenever time
> needs to be evaluated. I realize there is a benefit (aside from a trivial
> micro-optimization not worth worrying about) to having the time be "fixed",
> particularly for truth in the logging: if time is fixed by an updateNow()
> call, then the log for a timeout will still show exactly the same value the
> code reasoned about. However, this benefit is in my opinion not enough to
> merit the fragility of the contract which led to this (for us) highly
> impactful and difficult-to-find bug in the first place.
> I'm currently running ant tests locally against my patch on trunk, and then
> I'll upload it here.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)