Sounds reasonable to me.
On Fri, Oct 25, 2019 at 6:53 PM Alexey Serbin <[email protected]> wrote: > > Thank you Adar and Andrew for the thoughtful feedback. > > I think switching to the 'system_unsync' clock for non-EMC (i.e. not based > on External Mini-Cluster) tests is safe enough, even if it might be > affected by NTP and manual time changes. > > I guess the majority of Kudu tests which are now run with 'system' clock > source would not even notice if time travels back and/or forth during their > run. Those that notice are tests which should be set apart into a separate > time-sensitive category and run only > a. using 'system' time source when machine clock is synchronized with NTP > b. using the built-in NTP client when it's synchronized with a reliable > NTP server > > So far I like the idea of switching non-EMC tests to 'system_unsync' time > source and keeping EMC tests on the 'builtin' source. Meanwhile, I can > find out which tests are in the time-sensitive category: it would be an > environment where system time is jumping back and forth while tests are > running. > > Please let me know what you think about this. > > > Thanks, > > Alexey > > On Mon, Oct 7, 2019 at 6:46 PM Andrew Wong <[email protected]> > wrote: > > > I had a chat with Alexey about this that I think helped clarify some things > > for > > me, so thought I'd share (though feel free to correct me if I'm wrong about > > anything): > > - The current state of the master branch is that: > > - Our external mini cluster tests initialize MiniChronyd, and because of > > this, when used with the builtin client, the clock is considered > > synchronized almost immediately. > > - All our other tests don't do this. They continue to use the system > > clock, > > and thus, may be susceptible to the "slow NTP" startup issues we've > > seen > > before. > > - This thread is suggesting that we converge on using a single time piece: > > the `system_unsync` aka "local clock", which may not necessarily be > > synchronized by the Kernel with time servers, but would be good enough > > for > > all of our existing tests. This seems desirable because it: > > - Means that all our tests are run in the same way using the same clock > > semantics. > > - Should be sufficiently fast to "synchronize" because there's no actual > > NTP > > synchronization involved. And that's OK. Our tests don't care. > > > > I agree with this approach, but I also wouldn't be opposed to keeping our > > existing external mini cluster tests using the builtin client (and using > > the > > local clock for other tests), because: > > - It is closer to what we might expect in the real world because we expect > > to > > eventually default to using the builtin client. While not truly "real", > > since > > we're still synchronizing with something local and not an external set of > > time servers, it still seems like better coverage than none. > > - It should still net us fewer NTP-related "random" test failures. > > - The line between tests that use external mini clusters and tests that > > don't > > is very clear. And so the argument that it might be confusing to have > > different time semantics across tests isn't as compelling to me. > > > > I don't feel particularly strongly for this approach vs the one laid out by > > Alexey, but thought I'd bring it up since in general it seems like external > > mini cluster tests are meant to simulate as real of an environment as > > possible (modulo a lot of effort). Continuing to use the builtin client > > seems > > like a step in that direction since we hope it will be the default one day. > > > > On Mon, Oct 7, 2019 at 5:49 PM Adar Lieber-Dembo <[email protected] > > > > > wrote: > > > > > Thanks for the detailed summary. > > > > > > I agree with the goal of using the same time source across all tests > > > if possible. Originally, I was hoping this would be 'builtin', but > > > that'd require deploying MiniChronyd too, both to avoid a dependency > > > on Internet connectivity and to ensure that clock synchronization at > > > test startup is sufficiently fast. And, like you said, not only is > > > that a good chunk of work, it's also somewhat antithetical to those > > > pure "in-proc" test setups. > > > > > > So I reluctantly agree that 'system_unsync' may be the next best > > > option, with targeted uses of 'system' and 'builtin' for testing > > > specific functionality. That said, is it safe that 'system_unsync' > > > uses CLOCK_REALTIME under the hood, which is affected by NTP and admin > > > time changes? > > > > > > On Sun, Oct 6, 2019 at 11:27 PM Alexey Serbin > > > <[email protected]> wrote: > > > > > > > > Hi, > > > > > > > > Recently, built-in NTP client was been introduced in Kudu. Todd Lipcon > > > > put together the original WIP patch a couple of years ago [1], and > > about > > > a > > > > week > > > > ago the built-in NTP client was merged into the master branch of the > > Kudu > > > > repo. > > > > With that, a new time source is now available for Kudu masters and > > tablet > > > > servers: the built-in NTP client. > > > > > > > > With the introduction of the new time source for Kudu components, we > > have > > > > had a few offline discussions about using different time sources in > > Kudu > > > > tests. That's not only about providing test coverage for the newly > > > > introduced > > > > built-in NTP client, but for all other tests as well. Last week Adar > > > and I > > > > talked about that offline one more time. I'll try to summarize few key > > > > points in this e-mail message. > > > > > > > > With the introduction of the built-in NTP client, a significant part of > > > > tests > > > > was switched to run with it as the time source. Particularly, all > > tests > > > > based on ExternalMiniCluster are now run with built-in NTP client with > > > > commit > > > > 03e2ada69 [2]. The idea is to have more coverage for the newly added > > > > functionality, especially given the fact that at some point we might > > > switch > > > > to use the built-in NTP client by default instead of relying on the > > local > > > > machine clock synchronized by the system NTP service. > > > > > > > > There are many other Kudu tests (e.g., tests based on > > > InternalMiniCluster) > > > > which still require the machine clock to be synchronized with NTP to > > run. > > > > Ideally, it would be great to run all Kudu tests with using same time > > > source > > > > unless they are specifically targeted to verify functionality of a > > > > particular > > > > time source. An example of the latter are tests which cover the > > > > functionality > > > > of the built-in NTP client itself. > > > > > > > > Prior to the introduction of the built-in NTP client, almost all Kudu > > > tests > > > > were running against the local system clock synchronized with NTP, at > > > least > > > > on Linux. The only exception were tests for read-your-writes and other > > > > consistency guarantees: those use mock time source to control the time > > > > in a very specific way. > > > > > > > > In retrospective, it was great to have a single time source because we > > > > always > > > > knew that the same time source was used for almost all of our tests. > > > Also, > > > > the time source used in Kudu tests was the same as in Kudu masters and > > > > tablet > > > > servers running in real distributed clusters. From the other side, > > that > > > > required local machine's clock to be synchronized with NTP to run those > > > > tests, > > > > and the tests would fail if the time was not synchronized with NTP. > > > > In addition, as far as I know, there are no Kudu tests which are > > targeted > > > > to detect inconsistencies due to irregularities in the time source or > > > > non-synchronized clocks between different Kudu nodes. Of course, we > > have > > > > Jepsen-based tests, but that's another story: we are not talking about > > > > Jepsen > > > > tests here, only C++ and Java tests based on gtest and JUnit > > frameworks. > > > > > > > > Now, here comes one observation: all the components of above mentioned > > > tests > > > > are running at the same machine during execution of corresponding > > > scenarios. > > > > If all of them are using the local machine's clock, then there is no > > need > > > > to use NTP or other synchronisation techniques designed to synchronize > > > > multiple clocks. > > > > > > > > So, here comes the question: why to require the local clock to be > > > > synchronized > > > > with NTP for tests? Yes, we need to verify that the 'system' time > > source > > > > works for HybridTime timestamps, but as for the functionality of a > > > generic > > > > Kudu test which is simply based on the HybridTime timestamps without > > any > > > > knowledge of the underlying time source, why is it important? > > > > > > > > It seems we can safely change the time source in those generic tests to > > > be > > > > 'system_unsync': local machine clock which is not necessarily > > > synchronized > > > > by > > > > the kernel NTP discipline (see SystemUnsyncTime in > > > > $KUDU_ROOT/src/kudu/clock/system_unsync_time.{h,cc} for details). > > > > > > > > Once switched to 'system_unsync' time source for tests, in addition it > > > will > > > > be necessary to add new dedicated tests scenarios such as: > > > > * Smoke scenarios: using the system NTP time source for Kudu > > clusters. > > > > * Smoke scenarios: using the built-in NTP time source for Kudu > > > clusters. > > > > * Advanced scenarios to detect issues due to irregularities in time > > > source > > > > (like time jumping back and forth, etc.). > > > > > > > > What do you think? Your feedback is appreciated. > > > > > > > > Thank you! > > > > > > > > > > > > > > > > Kind regards, > > > > > > > > Alexey > > > > > > > > [1] http://gerrit.cloudera.org:8080/7477 > > > > [2] > > > > > > > > > https://github.com/apache/kudu/commit/03e2ada694290cafce0bea6ebbf092709aa64a2a > > > > > > > > > -- > > Andrew Wong > >
