[ 
https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043896#comment-17043896
 ] 

ASF subversion and git services commented on KUDU-3048:
-------------------------------------------------------

Commit 3adc2882a15af4e777cb8cd0bc7785f002a9402a in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=3adc288 ]

[clock] record maximum error in microseconds

Dedicated NTP servers of AWS/EC2 cloud instances report sub-millisecond
root delay and dispersion (see below), so it makes sense to switch
true time maximum error histogram into microseconds.

In addition, as a follow-up to the recent series of patches in the
context of KUDU-3048, I also added an extra metric gauge to track
latest maximum time error computed by the built-in NTP client.  I also
renamed 'builtin_ntp_walltime' into 'builtin_ntp_time' for brevity.

This is a follow-up to 7d9d7009 and 6371ffdb5.

[root@aws-centos]# chronyc sources
210 Number of sources = 1
MS Name/IP address         Stratum Poll Reach LastRx Last sample
===============================================================================
^* 169.254.169.123               3   8   377   106  +1063ns[+1441ns] +/-  466us

[root@aws-centos]# chronyc tracking
Reference ID    : A9FEA97B (169.254.169.123)
Stratum         : 4
Ref time (UTC)  : Sun Feb 23 06:59:58 2020
System time     : 0.000001781 seconds slow of NTP time
Last offset     : +0.000000378 seconds
RMS offset      : 0.000012111 seconds
Frequency       : 30.588 ppm fast
Residual freq   : -0.000 ppm
Skew            : 0.019 ppm
Root delay      : 0.000298246 seconds
Root dispersion : 0.000397126 seconds
Update interval : 260.0 seconds
Leap status     : Normal

[root@aws-centos]# chronyc ntpdata 169.254.169.123
Remote address  : 169.254.169.123 (A9FEA97B)
Remote port     : 123
Local address   : 10.65.14.145 (0A410E91)
Leap status     : Normal
Version         : 4
Mode            : Server
Stratum         : 3
Poll interval   : 8 (256 seconds)
Precision       : -18 (0.000003815 seconds)
Root delay      : 0.000198 seconds
Root dispersion : 0.000305 seconds
Reference ID    : A9FEA97A ()
Reference time  : Sun Feb 23 06:55:06 2020
Offset          : +0.000014476 seconds
Peer delay      : 0.000101295 seconds
Peer dispersion : 0.000003855 seconds
Response time   : 0.000015916 seconds
Jitter asymmetry: +0.00
NTP tests       : 111 111 1111
Interleaved     : No
Authenticated   : No
TX timestamping : Kernel
RX timestamping : Kernel
Total TX        : 712
Total RX        : 712
Total valid RX  : 712

Change-Id: Idaa950fd3dff3e2a4cedbf5ae5b93b49f7b9465c
Reviewed-on: http://gerrit.cloudera.org:8080/15275
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins


> Add time/clock synchronization metrics
> --------------------------------------
>
>                 Key: KUDU-3048
>                 URL: https://issues.apache.org/jira/browse/KUDU-3048
>             Project: Kudu
>          Issue Type: Improvement
>          Components: clock, master, tserver
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>              Labels: clock
>             Fix For: 1.12.0
>
>
> For better visibility, it would be great to add metrics reflecting time/clock 
> synchronization parameters:
> * the stats on the max_error sampled while reading the underlying clock
> * the stats on time intervals when the underlying clock was extrapolated 
> instead of using the actual readings: number of such intervals and stats on 
> the interval duration
> * whether hybrid clock timestamps are generated using interpolated clock 
> readings instead of real ones
> * if using the {{built-in}} time source:
> ** difference between tracked true time and local wallclock
> ** most recently computed true time
> ** the stats on the maximum error of the computed true time
> As for the rationale behind the new metrics:
> * max_error shows how far the clock is from the true time, and maybe it's 
> time to use other set of NTP servers or instead increase the 
> {{\-\-max_clock_sync_error_usec}} flag value
> * presence of the extrapolation intervals for the hybrid clock signals about 
> periods of non-availability for NTP servers, and possible action would be 
> re-visiting the set of NTP servers
> * if hybrid timestamps are being extrapolated for some time, Kudu masters and 
> tablet servers might crash if the clock errors eventually goes beyond the 
> configured threshold: it's time to start troubleshooting the issue to avoid 
> possible non-availability of the cluster
> * the delta between true time tracked by the built-in NTP client and the 
> local system clock is useful to understand how the log timestamps are related 
> to the HybridClock timestamps (in case of using the built-in NTP client those 
> might diverge)
> * the stats on true time computed by the built-in NTP client give insights on 
> the quality of the reference NTP servers
> The new metrics can be used for monitoring and alerting, allowing for 
> pro-active maintenance of a Kudu cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to