Hi,
Recently, built-in NTP client was been introduced in Kudu. Todd Lipcon
put together the original WIP patch a couple of years ago [1], and about a
week
ago the built-in NTP client was merged into the master branch of the Kudu
repo.
With that, a new time source is now available for Kudu masters and tablet
servers: the built-in NTP client.
With the introduction of the new time source for Kudu components, we have
had a few offline discussions about using different time sources in Kudu
tests. That's not only about providing test coverage for the newly
introduced
built-in NTP client, but for all other tests as well. Last week Adar and I
talked about that offline one more time. I'll try to summarize few key
points in this e-mail message.
With the introduction of the built-in NTP client, a significant part of
tests
was switched to run with it as the time source. Particularly, all tests
based on ExternalMiniCluster are now run with built-in NTP client with
commit
03e2ada69 [2]. The idea is to have more coverage for the newly added
functionality, especially given the fact that at some point we might switch
to use the built-in NTP client by default instead of relying on the local
machine clock synchronized by the system NTP service.
There are many other Kudu tests (e.g., tests based on InternalMiniCluster)
which still require the machine clock to be synchronized with NTP to run.
Ideally, it would be great to run all Kudu tests with using same time source
unless they are specifically targeted to verify functionality of a
particular
time source. An example of the latter are tests which cover the
functionality
of the built-in NTP client itself.
Prior to the introduction of the built-in NTP client, almost all Kudu tests
were running against the local system clock synchronized with NTP, at least
on Linux. The only exception were tests for read-your-writes and other
consistency guarantees: those use mock time source to control the time
in a very specific way.
In retrospective, it was great to have a single time source because we
always
knew that the same time source was used for almost all of our tests. Also,
the time source used in Kudu tests was the same as in Kudu masters and
tablet
servers running in real distributed clusters. From the other side, that
required local machine's clock to be synchronized with NTP to run those
tests,
and the tests would fail if the time was not synchronized with NTP.
In addition, as far as I know, there are no Kudu tests which are targeted
to detect inconsistencies due to irregularities in the time source or
non-synchronized clocks between different Kudu nodes. Of course, we have
Jepsen-based tests, but that's another story: we are not talking about
Jepsen
tests here, only C++ and Java tests based on gtest and JUnit frameworks.
Now, here comes one observation: all the components of above mentioned tests
are running at the same machine during execution of corresponding scenarios.
If all of them are using the local machine's clock, then there is no need
to use NTP or other synchronisation techniques designed to synchronize
multiple clocks.
So, here comes the question: why to require the local clock to be
synchronized
with NTP for tests? Yes, we need to verify that the 'system' time source
works for HybridTime timestamps, but as for the functionality of a generic
Kudu test which is simply based on the HybridTime timestamps without any
knowledge of the underlying time source, why is it important?
It seems we can safely change the time source in those generic tests to be
'system_unsync': local machine clock which is not necessarily synchronized
by
the kernel NTP discipline (see SystemUnsyncTime in
$KUDU_ROOT/src/kudu/clock/system_unsync_time.{h,cc} for details).
Once switched to 'system_unsync' time source for tests, in addition it will
be necessary to add new dedicated tests scenarios such as:
* Smoke scenarios: using the system NTP time source for Kudu clusters.
* Smoke scenarios: using the built-in NTP time source for Kudu clusters.
* Advanced scenarios to detect issues due to irregularities in time source
(like time jumping back and forth, etc.).
What do you think? Your feedback is appreciated.
Thank you!
Kind regards,
Alexey
[1] http://gerrit.cloudera.org:8080/7477
[2]
https://github.com/apache/kudu/commit/03e2ada694290cafce0bea6ebbf092709aa64a2a