[
https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin updated KUDU-3048:
--------------------------------
Description:
For better visibility, it would be great to add metrics reflecting time/clock
synchronization parameters:
* the stats on the max_error sampled while reading the underlying clock
* the stats on time intervals when the underlying clock was extrapolated
instead of using the actual readings: number of such intervals and stats on the
interval duration
* whether hybrid clock timestamps are generated using interpolated clock
readings instead of real ones
* if using the {{built-in}} time source:
** difference between tracked true time and local wallclock
** most recently computed true time
** the stats on the maximum error of the computed true time
As for the rationale behind the new metrics:
* max_error shows how far the clock is from the true time, and maybe it's time
to use other set of NTP servers or instead increase the
{{\-\-max_clock_sync_error_usec}} flag value
* presence of the extrapolation intervals for the hybrid clock signals about
periods of non-availability for NTP servers, and possible action would be
re-visiting the set of NTP servers
* if hybrid timestamps are being extrapolated for some time, Kudu masters and
tablet servers might crash if the clock errors eventually goes beyond the
configured threshold: it's time to start troubleshooting the issue to avoid
possible non-availability of the cluster
* the delta between true time tracked by the built-in NTP client and the local
system clock is useful to understand how the log timestamps are related to the
HybridClock timestamps (in case of using the built-in NTP client those might
diverge)
* the stats on true time computed by the built-in NTP client give insights on
the quality of the reference NTP servers
The new metrics can be used for monitoring and alerting, allowing for
pro-active maintenance of a Kudu cluster.
was:
For better visibility, it would be great to add metrics reflecting time/clock
synchronization parameters:
* the stats on the max_error sampled while reading the underlying clock
* the stats on time intervals when the underlying clock was extrapolated
instead of using the actual readings: number of such intervals and stats on the
interval duration
* whether hybrid clock timestamps are generated using interpolated clock
readings instead of real ones
* if using the {{built-in}} time source:
** difference between tracked true time and local wallclock
** most recently computed true time
** the stats on the maximum error of the computed true time
As for the rationale behind the new metrics:
* max_error shows how far the clock is from the true time, and maybe it's time
to use other set of NTP servers or instead increase the
{{\-\-max_clock_sync_error_usec}} flag value
* presence of the extrapolation intervals for the hybrid clock signals about
periods of non-availability for NTP servers, and possible action would be
re-visiting the set of NTP servers
* if hybrid timestamps are being extrapolated for some time, Kudu masters and
tablet servers might crash if the clock errors eventually goes beyond the
configured threshold: it's time to start troubleshooting the issue to avoid
possible non-availability of the cluster
The new metrics can be used for monitoring and alerting, allowing for
pro-active maintenance of a Kudu cluster.
> Add time/clock synchronization metrics
> --------------------------------------
>
> Key: KUDU-3048
> URL: https://issues.apache.org/jira/browse/KUDU-3048
> Project: Kudu
> Issue Type: Improvement
> Components: clock, master, tserver
> Reporter: Alexey Serbin
> Assignee: Alexey Serbin
> Priority: Major
> Labels: clock
>
> For better visibility, it would be great to add metrics reflecting time/clock
> synchronization parameters:
> * the stats on the max_error sampled while reading the underlying clock
> * the stats on time intervals when the underlying clock was extrapolated
> instead of using the actual readings: number of such intervals and stats on
> the interval duration
> * whether hybrid clock timestamps are generated using interpolated clock
> readings instead of real ones
> * if using the {{built-in}} time source:
> ** difference between tracked true time and local wallclock
> ** most recently computed true time
> ** the stats on the maximum error of the computed true time
> As for the rationale behind the new metrics:
> * max_error shows how far the clock is from the true time, and maybe it's
> time to use other set of NTP servers or instead increase the
> {{\-\-max_clock_sync_error_usec}} flag value
> * presence of the extrapolation intervals for the hybrid clock signals about
> periods of non-availability for NTP servers, and possible action would be
> re-visiting the set of NTP servers
> * if hybrid timestamps are being extrapolated for some time, Kudu masters and
> tablet servers might crash if the clock errors eventually goes beyond the
> configured threshold: it's time to start troubleshooting the issue to avoid
> possible non-availability of the cluster
> * the delta between true time tracked by the built-in NTP client and the
> local system clock is useful to understand how the log timestamps are related
> to the HybridClock timestamps (in case of using the built-in NTP client those
> might diverge)
> * the stats on true time computed by the built-in NTP client give insights on
> the quality of the reference NTP servers
> The new metrics can be used for monitoring and alerting, allowing for
> pro-active maintenance of a Kudu cluster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)