Sounds reasonable to me.

On Fri, Oct 25, 2019 at 6:53 PM Alexey Serbin
<[email protected]> wrote:
>
> Thank you Adar and Andrew for the thoughtful feedback.
>
> I think switching to the 'system_unsync' clock for non-EMC (i.e. not based
> on External Mini-Cluster) tests is safe enough, even if it might be
> affected by NTP and manual time changes.
>
> I guess the majority of Kudu tests which are now run with 'system' clock
> source would not even notice if time travels back and/or forth during their
> run.  Those that notice are tests which should be set apart into a separate
> time-sensitive category and run only
>   a. using 'system' time source when machine clock is synchronized with NTP
>   b. using the built-in NTP client when it's synchronized with a reliable
> NTP server
>
> So far I like the idea of switching non-EMC tests to 'system_unsync' time
> source and keeping EMC tests on the 'builtin' source.  Meanwhile, I can
> find out which tests are in the time-sensitive category: it would be an
> environment where system time is jumping back and forth while tests are
> running.
>
> Please let me know what you think about this.
>
>
> Thanks,
>
> Alexey
>
> On Mon, Oct 7, 2019 at 6:46 PM Andrew Wong <[email protected]>
> wrote:
>
> > I had a chat with Alexey about this that I think helped clarify some things
> > for
> > me, so thought I'd share (though feel free to correct me if I'm wrong about
> > anything):
> > - The current state of the master branch is that:
> >   - Our external mini cluster tests initialize MiniChronyd, and because of
> >     this, when used with the builtin client, the clock is considered
> >     synchronized almost immediately.
> >   - All our other tests don't do this. They continue to use the system
> > clock,
> >     and thus, may be susceptible to the "slow NTP" startup issues we've
> > seen
> >     before.
> > - This thread is suggesting that we converge on using a single time piece:
> >   the `system_unsync` aka "local clock", which may not necessarily be
> >   synchronized by the Kernel with time servers, but would be good enough
> > for
> >   all of our existing tests. This seems desirable because it:
> >   - Means that all our tests are run in the same way using the same clock
> >     semantics.
> >   - Should be sufficiently fast to "synchronize" because there's no actual
> > NTP
> >     synchronization involved. And that's OK. Our tests don't care.
> >
> > I agree with this approach, but I also wouldn't be opposed to keeping our
> > existing external mini cluster tests using the builtin client (and using
> > the
> > local clock for other tests), because:
> > - It is closer to what we might expect in the real world because we expect
> > to
> >   eventually default to using the builtin client. While not truly "real",
> > since
> >   we're still synchronizing with something local and not an external set of
> >   time servers, it still seems like better coverage than none.
> > - It should still net us fewer NTP-related "random" test failures.
> > - The line between tests that use external mini clusters and tests that
> > don't
> >   is very clear. And so the argument that it might be confusing to have
> >   different time semantics across tests isn't as compelling to me.
> >
> > I don't feel particularly strongly for this approach vs the one laid out by
> > Alexey, but thought I'd bring it up since in general it seems like external
> > mini cluster tests are meant to simulate as real of an environment as
> > possible (modulo a lot of effort). Continuing to use the builtin client
> > seems
> > like a step in that direction since we hope it will be the default one day.
> >
> > On Mon, Oct 7, 2019 at 5:49 PM Adar Lieber-Dembo <[email protected]
> > >
> > wrote:
> >
> > > Thanks for the detailed summary.
> > >
> > > I agree with the goal of using the same time source across all tests
> > > if possible. Originally, I was hoping this would be 'builtin', but
> > > that'd require deploying MiniChronyd too, both to avoid a dependency
> > > on Internet connectivity and to ensure that clock synchronization at
> > > test startup is sufficiently fast. And, like you said, not only is
> > > that a good chunk of work, it's also somewhat antithetical to those
> > > pure "in-proc" test setups.
> > >
> > > So I reluctantly agree that 'system_unsync' may be the next best
> > > option, with targeted uses of 'system' and 'builtin' for testing
> > > specific functionality. That said, is it safe that 'system_unsync'
> > > uses CLOCK_REALTIME under the hood, which is affected by NTP and admin
> > > time changes?
> > >
> > > On Sun, Oct 6, 2019 at 11:27 PM Alexey Serbin
> > > <[email protected]> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Recently, built-in NTP client was been introduced in Kudu.  Todd Lipcon
> > > > put together the original WIP patch a couple of years ago [1], and
> > about
> > > a
> > > > week
> > > > ago the built-in NTP client was merged into the master branch of the
> > Kudu
> > > > repo.
> > > > With that, a new time source is now available for Kudu masters and
> > tablet
> > > > servers: the built-in NTP client.
> > > >
> > > > With the introduction of the new time source for Kudu components, we
> > have
> > > > had a few offline discussions about using different time sources in
> > Kudu
> > > > tests.  That's not only about providing test coverage for the newly
> > > > introduced
> > > > built-in NTP client, but for all other tests as well.  Last week Adar
> > > and I
> > > > talked about that offline one more time.  I'll try to summarize few key
> > > > points in this e-mail message.
> > > >
> > > > With the introduction of the built-in NTP client, a significant part of
> > > > tests
> > > > was switched to run with it as the time source.  Particularly, all
> > tests
> > > > based on ExternalMiniCluster are now run with built-in NTP client with
> > > > commit
> > > > 03e2ada69 [2].  The idea is to have more coverage for the newly added
> > > > functionality, especially given the fact that at some point we might
> > > switch
> > > > to use the built-in NTP client by default instead of relying on the
> > local
> > > > machine clock synchronized by the system NTP service.
> > > >
> > > > There are many other Kudu tests (e.g., tests based on
> > > InternalMiniCluster)
> > > > which still require the machine clock to be synchronized with NTP to
> > run.
> > > > Ideally, it would be great to run all Kudu tests with using same time
> > > source
> > > > unless they are specifically targeted to verify functionality of a
> > > > particular
> > > > time source.  An example of the latter are tests which cover the
> > > > functionality
> > > > of the built-in NTP client itself.
> > > >
> > > > Prior to the introduction of the built-in NTP client, almost all Kudu
> > > tests
> > > > were running against the local system clock synchronized with NTP, at
> > > least
> > > > on Linux.  The only exception were tests for read-your-writes and other
> > > > consistency guarantees: those use mock time source to control the time
> > > > in a very specific way.
> > > >
> > > > In retrospective, it was great to have a single time source because we
> > > > always
> > > > knew that the same time source was used for almost all of our tests.
> > > Also,
> > > > the time source used in Kudu tests was the same as in Kudu masters and
> > > > tablet
> > > > servers running in real distributed clusters.  From the other side,
> > that
> > > > required local machine's clock to be synchronized with NTP to run those
> > > > tests,
> > > > and the tests would fail if the time was not synchronized with NTP.
> > > > In addition, as far as I know, there are no Kudu tests which are
> > targeted
> > > > to detect inconsistencies due to irregularities in the time source or
> > > > non-synchronized clocks between different Kudu nodes.  Of course, we
> > have
> > > > Jepsen-based tests, but that's another story: we are not talking about
> > > > Jepsen
> > > > tests here, only C++ and Java tests based on gtest and JUnit
> > frameworks.
> > > >
> > > > Now, here comes one observation: all the components of above mentioned
> > > tests
> > > > are running at the same machine during execution of corresponding
> > > scenarios.
> > > > If all of them are using the local machine's clock, then there is no
> > need
> > > > to use NTP or other synchronisation techniques designed to synchronize
> > > > multiple clocks.
> > > >
> > > > So, here comes the question: why to require the local clock to be
> > > > synchronized
> > > > with NTP for tests?  Yes, we need to verify that the 'system' time
> > source
> > > > works for HybridTime timestamps, but as for the functionality of a
> > > generic
> > > > Kudu test which is simply based on the HybridTime timestamps without
> > any
> > > > knowledge of the underlying time source, why is it important?
> > > >
> > > > It seems we can safely change the time source in those generic tests to
> > > be
> > > > 'system_unsync': local machine clock which is not necessarily
> > > synchronized
> > > > by
> > > > the kernel NTP discipline (see SystemUnsyncTime in
> > > > $KUDU_ROOT/src/kudu/clock/system_unsync_time.{h,cc} for details).
> > > >
> > > > Once switched to 'system_unsync' time source for tests, in addition it
> > > will
> > > > be necessary to add new dedicated tests scenarios such as:
> > > >   * Smoke scenarios: using the system NTP time source for Kudu
> > clusters.
> > > >   * Smoke scenarios: using the built-in NTP time source for Kudu
> > > clusters.
> > > >   * Advanced scenarios to detect issues due to irregularities in time
> > > source
> > > >     (like time jumping back and forth, etc.).
> > > >
> > > > What do you think?  Your feedback is appreciated.
> > > >
> > > > Thank you!
> > > >
> > > >
> > > >
> > > > Kind regards,
> > > >
> > > > Alexey
> > > >
> > > > [1] http://gerrit.cloudera.org:8080/7477
> > > > [2]
> > > >
> > >
> > https://github.com/apache/kudu/commit/03e2ada694290cafce0bea6ebbf092709aa64a2a
> > >
> >
> >
> > --
> > Andrew Wong
> >

Reply via email to