Thank you Adar and Andrew for the thoughtful feedback. I think switching to the 'system_unsync' clock for non-EMC (i.e. not based on External Mini-Cluster) tests is safe enough, even if it might be affected by NTP and manual time changes.
I guess the majority of Kudu tests which are now run with 'system' clock source would not even notice if time travels back and/or forth during their run. Those that notice are tests which should be set apart into a separate time-sensitive category and run only a. using 'system' time source when machine clock is synchronized with NTP b. using the built-in NTP client when it's synchronized with a reliable NTP server So far I like the idea of switching non-EMC tests to 'system_unsync' time source and keeping EMC tests on the 'builtin' source. Meanwhile, I can find out which tests are in the time-sensitive category: it would be an environment where system time is jumping back and forth while tests are running. Please let me know what you think about this. Thanks, Alexey On Mon, Oct 7, 2019 at 6:46 PM Andrew Wong <[email protected]> wrote: > I had a chat with Alexey about this that I think helped clarify some things > for > me, so thought I'd share (though feel free to correct me if I'm wrong about > anything): > - The current state of the master branch is that: > - Our external mini cluster tests initialize MiniChronyd, and because of > this, when used with the builtin client, the clock is considered > synchronized almost immediately. > - All our other tests don't do this. They continue to use the system > clock, > and thus, may be susceptible to the "slow NTP" startup issues we've > seen > before. > - This thread is suggesting that we converge on using a single time piece: > the `system_unsync` aka "local clock", which may not necessarily be > synchronized by the Kernel with time servers, but would be good enough > for > all of our existing tests. This seems desirable because it: > - Means that all our tests are run in the same way using the same clock > semantics. > - Should be sufficiently fast to "synchronize" because there's no actual > NTP > synchronization involved. And that's OK. Our tests don't care. > > I agree with this approach, but I also wouldn't be opposed to keeping our > existing external mini cluster tests using the builtin client (and using > the > local clock for other tests), because: > - It is closer to what we might expect in the real world because we expect > to > eventually default to using the builtin client. While not truly "real", > since > we're still synchronizing with something local and not an external set of > time servers, it still seems like better coverage than none. > - It should still net us fewer NTP-related "random" test failures. > - The line between tests that use external mini clusters and tests that > don't > is very clear. And so the argument that it might be confusing to have > different time semantics across tests isn't as compelling to me. > > I don't feel particularly strongly for this approach vs the one laid out by > Alexey, but thought I'd bring it up since in general it seems like external > mini cluster tests are meant to simulate as real of an environment as > possible (modulo a lot of effort). Continuing to use the builtin client > seems > like a step in that direction since we hope it will be the default one day. > > On Mon, Oct 7, 2019 at 5:49 PM Adar Lieber-Dembo <[email protected] > > > wrote: > > > Thanks for the detailed summary. > > > > I agree with the goal of using the same time source across all tests > > if possible. Originally, I was hoping this would be 'builtin', but > > that'd require deploying MiniChronyd too, both to avoid a dependency > > on Internet connectivity and to ensure that clock synchronization at > > test startup is sufficiently fast. And, like you said, not only is > > that a good chunk of work, it's also somewhat antithetical to those > > pure "in-proc" test setups. > > > > So I reluctantly agree that 'system_unsync' may be the next best > > option, with targeted uses of 'system' and 'builtin' for testing > > specific functionality. That said, is it safe that 'system_unsync' > > uses CLOCK_REALTIME under the hood, which is affected by NTP and admin > > time changes? > > > > On Sun, Oct 6, 2019 at 11:27 PM Alexey Serbin > > <[email protected]> wrote: > > > > > > Hi, > > > > > > Recently, built-in NTP client was been introduced in Kudu. Todd Lipcon > > > put together the original WIP patch a couple of years ago [1], and > about > > a > > > week > > > ago the built-in NTP client was merged into the master branch of the > Kudu > > > repo. > > > With that, a new time source is now available for Kudu masters and > tablet > > > servers: the built-in NTP client. > > > > > > With the introduction of the new time source for Kudu components, we > have > > > had a few offline discussions about using different time sources in > Kudu > > > tests. That's not only about providing test coverage for the newly > > > introduced > > > built-in NTP client, but for all other tests as well. Last week Adar > > and I > > > talked about that offline one more time. I'll try to summarize few key > > > points in this e-mail message. > > > > > > With the introduction of the built-in NTP client, a significant part of > > > tests > > > was switched to run with it as the time source. Particularly, all > tests > > > based on ExternalMiniCluster are now run with built-in NTP client with > > > commit > > > 03e2ada69 [2]. The idea is to have more coverage for the newly added > > > functionality, especially given the fact that at some point we might > > switch > > > to use the built-in NTP client by default instead of relying on the > local > > > machine clock synchronized by the system NTP service. > > > > > > There are many other Kudu tests (e.g., tests based on > > InternalMiniCluster) > > > which still require the machine clock to be synchronized with NTP to > run. > > > Ideally, it would be great to run all Kudu tests with using same time > > source > > > unless they are specifically targeted to verify functionality of a > > > particular > > > time source. An example of the latter are tests which cover the > > > functionality > > > of the built-in NTP client itself. > > > > > > Prior to the introduction of the built-in NTP client, almost all Kudu > > tests > > > were running against the local system clock synchronized with NTP, at > > least > > > on Linux. The only exception were tests for read-your-writes and other > > > consistency guarantees: those use mock time source to control the time > > > in a very specific way. > > > > > > In retrospective, it was great to have a single time source because we > > > always > > > knew that the same time source was used for almost all of our tests. > > Also, > > > the time source used in Kudu tests was the same as in Kudu masters and > > > tablet > > > servers running in real distributed clusters. From the other side, > that > > > required local machine's clock to be synchronized with NTP to run those > > > tests, > > > and the tests would fail if the time was not synchronized with NTP. > > > In addition, as far as I know, there are no Kudu tests which are > targeted > > > to detect inconsistencies due to irregularities in the time source or > > > non-synchronized clocks between different Kudu nodes. Of course, we > have > > > Jepsen-based tests, but that's another story: we are not talking about > > > Jepsen > > > tests here, only C++ and Java tests based on gtest and JUnit > frameworks. > > > > > > Now, here comes one observation: all the components of above mentioned > > tests > > > are running at the same machine during execution of corresponding > > scenarios. > > > If all of them are using the local machine's clock, then there is no > need > > > to use NTP or other synchronisation techniques designed to synchronize > > > multiple clocks. > > > > > > So, here comes the question: why to require the local clock to be > > > synchronized > > > with NTP for tests? Yes, we need to verify that the 'system' time > source > > > works for HybridTime timestamps, but as for the functionality of a > > generic > > > Kudu test which is simply based on the HybridTime timestamps without > any > > > knowledge of the underlying time source, why is it important? > > > > > > It seems we can safely change the time source in those generic tests to > > be > > > 'system_unsync': local machine clock which is not necessarily > > synchronized > > > by > > > the kernel NTP discipline (see SystemUnsyncTime in > > > $KUDU_ROOT/src/kudu/clock/system_unsync_time.{h,cc} for details). > > > > > > Once switched to 'system_unsync' time source for tests, in addition it > > will > > > be necessary to add new dedicated tests scenarios such as: > > > * Smoke scenarios: using the system NTP time source for Kudu > clusters. > > > * Smoke scenarios: using the built-in NTP time source for Kudu > > clusters. > > > * Advanced scenarios to detect issues due to irregularities in time > > source > > > (like time jumping back and forth, etc.). > > > > > > What do you think? Your feedback is appreciated. > > > > > > Thank you! > > > > > > > > > > > > Kind regards, > > > > > > Alexey > > > > > > [1] http://gerrit.cloudera.org:8080/7477 > > > [2] > > > > > > https://github.com/apache/kudu/commit/03e2ada694290cafce0bea6ebbf092709aa64a2a > > > > > -- > Andrew Wong >
