I had a chat with Alexey about this that I think helped clarify some things
for
me, so thought I'd share (though feel free to correct me if I'm wrong about
anything):
- The current state of the master branch is that:
- Our external mini cluster tests initialize MiniChronyd, and because of
this, when used with the builtin client, the clock is considered
synchronized almost immediately.
- All our other tests don't do this. They continue to use the system
clock,
and thus, may be susceptible to the "slow NTP" startup issues we've seen
before.
- This thread is suggesting that we converge on using a single time piece:
the `system_unsync` aka "local clock", which may not necessarily be
synchronized by the Kernel with time servers, but would be good enough for
all of our existing tests. This seems desirable because it:
- Means that all our tests are run in the same way using the same clock
semantics.
- Should be sufficiently fast to "synchronize" because there's no actual
NTP
synchronization involved. And that's OK. Our tests don't care.
I agree with this approach, but I also wouldn't be opposed to keeping our
existing external mini cluster tests using the builtin client (and using the
local clock for other tests), because:
- It is closer to what we might expect in the real world because we expect
to
eventually default to using the builtin client. While not truly "real",
since
we're still synchronizing with something local and not an external set of
time servers, it still seems like better coverage than none.
- It should still net us fewer NTP-related "random" test failures.
- The line between tests that use external mini clusters and tests that
don't
is very clear. And so the argument that it might be confusing to have
different time semantics across tests isn't as compelling to me.
I don't feel particularly strongly for this approach vs the one laid out by
Alexey, but thought I'd bring it up since in general it seems like external
mini cluster tests are meant to simulate as real of an environment as
possible (modulo a lot of effort). Continuing to use the builtin client
seems
like a step in that direction since we hope it will be the default one day.
On Mon, Oct 7, 2019 at 5:49 PM Adar Lieber-Dembo <[email protected]>
wrote:
> Thanks for the detailed summary.
>
> I agree with the goal of using the same time source across all tests
> if possible. Originally, I was hoping this would be 'builtin', but
> that'd require deploying MiniChronyd too, both to avoid a dependency
> on Internet connectivity and to ensure that clock synchronization at
> test startup is sufficiently fast. And, like you said, not only is
> that a good chunk of work, it's also somewhat antithetical to those
> pure "in-proc" test setups.
>
> So I reluctantly agree that 'system_unsync' may be the next best
> option, with targeted uses of 'system' and 'builtin' for testing
> specific functionality. That said, is it safe that 'system_unsync'
> uses CLOCK_REALTIME under the hood, which is affected by NTP and admin
> time changes?
>
> On Sun, Oct 6, 2019 at 11:27 PM Alexey Serbin
> <[email protected]> wrote:
> >
> > Hi,
> >
> > Recently, built-in NTP client was been introduced in Kudu. Todd Lipcon
> > put together the original WIP patch a couple of years ago [1], and about
> a
> > week
> > ago the built-in NTP client was merged into the master branch of the Kudu
> > repo.
> > With that, a new time source is now available for Kudu masters and tablet
> > servers: the built-in NTP client.
> >
> > With the introduction of the new time source for Kudu components, we have
> > had a few offline discussions about using different time sources in Kudu
> > tests. That's not only about providing test coverage for the newly
> > introduced
> > built-in NTP client, but for all other tests as well. Last week Adar
> and I
> > talked about that offline one more time. I'll try to summarize few key
> > points in this e-mail message.
> >
> > With the introduction of the built-in NTP client, a significant part of
> > tests
> > was switched to run with it as the time source. Particularly, all tests
> > based on ExternalMiniCluster are now run with built-in NTP client with
> > commit
> > 03e2ada69 [2]. The idea is to have more coverage for the newly added
> > functionality, especially given the fact that at some point we might
> switch
> > to use the built-in NTP client by default instead of relying on the local
> > machine clock synchronized by the system NTP service.
> >
> > There are many other Kudu tests (e.g., tests based on
> InternalMiniCluster)
> > which still require the machine clock to be synchronized with NTP to run.
> > Ideally, it would be great to run all Kudu tests with using same time
> source
> > unless they are specifically targeted to verify functionality of a
> > particular
> > time source. An example of the latter are tests which cover the
> > functionality
> > of the built-in NTP client itself.
> >
> > Prior to the introduction of the built-in NTP client, almost all Kudu
> tests
> > were running against the local system clock synchronized with NTP, at
> least
> > on Linux. The only exception were tests for read-your-writes and other
> > consistency guarantees: those use mock time source to control the time
> > in a very specific way.
> >
> > In retrospective, it was great to have a single time source because we
> > always
> > knew that the same time source was used for almost all of our tests.
> Also,
> > the time source used in Kudu tests was the same as in Kudu masters and
> > tablet
> > servers running in real distributed clusters. From the other side, that
> > required local machine's clock to be synchronized with NTP to run those
> > tests,
> > and the tests would fail if the time was not synchronized with NTP.
> > In addition, as far as I know, there are no Kudu tests which are targeted
> > to detect inconsistencies due to irregularities in the time source or
> > non-synchronized clocks between different Kudu nodes. Of course, we have
> > Jepsen-based tests, but that's another story: we are not talking about
> > Jepsen
> > tests here, only C++ and Java tests based on gtest and JUnit frameworks.
> >
> > Now, here comes one observation: all the components of above mentioned
> tests
> > are running at the same machine during execution of corresponding
> scenarios.
> > If all of them are using the local machine's clock, then there is no need
> > to use NTP or other synchronisation techniques designed to synchronize
> > multiple clocks.
> >
> > So, here comes the question: why to require the local clock to be
> > synchronized
> > with NTP for tests? Yes, we need to verify that the 'system' time source
> > works for HybridTime timestamps, but as for the functionality of a
> generic
> > Kudu test which is simply based on the HybridTime timestamps without any
> > knowledge of the underlying time source, why is it important?
> >
> > It seems we can safely change the time source in those generic tests to
> be
> > 'system_unsync': local machine clock which is not necessarily
> synchronized
> > by
> > the kernel NTP discipline (see SystemUnsyncTime in
> > $KUDU_ROOT/src/kudu/clock/system_unsync_time.{h,cc} for details).
> >
> > Once switched to 'system_unsync' time source for tests, in addition it
> will
> > be necessary to add new dedicated tests scenarios such as:
> > * Smoke scenarios: using the system NTP time source for Kudu clusters.
> > * Smoke scenarios: using the built-in NTP time source for Kudu
> clusters.
> > * Advanced scenarios to detect issues due to irregularities in time
> source
> > (like time jumping back and forth, etc.).
> >
> > What do you think? Your feedback is appreciated.
> >
> > Thank you!
> >
> >
> >
> > Kind regards,
> >
> > Alexey
> >
> > [1] http://gerrit.cloudera.org:8080/7477
> > [2]
> >
> https://github.com/apache/kudu/commit/03e2ada694290cafce0bea6ebbf092709aa64a2a
>
--
Andrew Wong