[ 
https://issues.apache.org/jira/browse/KUDU-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17037305#comment-17037305
 ] 

ASF subversion and git services commented on KUDU-3048:
-------------------------------------------------------

Commit 6371ffdb54fa09f891e01a120e74993f38286c9f in kudu's branch 
refs/heads/master from Alexey Serbin
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=6371ffd ]

[hybrid clock] KUDU-3048 introduce new clock metrics

Introduced additional metrics for the hybrid clock:
  * whether hybrid clock is using extrapolated readings for the
    underlying clock instead of actual readings
  * histogram for the duration of intervals when the underlying clock
    was extrapolated
  * histogram for the maximum errors reported by the underlying clock

I ran a small Kudu cluster to manually verify the behavior of the
newly introduced metrics: I'm not sure it's worth adding automated
tests for this given the already existing 'hybrid_clock_error' metric
didn't have any test coverage.

Change-Id: I8575ba7d8baed78b13351e8cebf1a74f44b31b82
Reviewed-on: http://gerrit.cloudera.org:8080/15212
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <[email protected]>


> Add time/clock synchronization metrics
> --------------------------------------
>
>                 Key: KUDU-3048
>                 URL: https://issues.apache.org/jira/browse/KUDU-3048
>             Project: Kudu
>          Issue Type: Improvement
>          Components: clock, master, tserver
>            Reporter: Alexey Serbin
>            Assignee: Alexey Serbin
>            Priority: Major
>              Labels: clock
>
> For better visibility, it would be great to add metrics reflecting time/clock 
> synchronization parameters:
> * the stats on the max_error sampled while reading the underlying clock
> * the stats on time intervals when the underlying clock was extrapolated 
> instead of using the actual readings: number of such intervals and stats on 
> the interval duration
> * whether hybrid clock timestamps are generated using interpolated clock 
> readings instead of real ones
> * if using the {{built-in}} time source:
> ** the number of servers used for the true time tracking (good references)
> ** the number of servers not used for the true time tracking (bad references)
> As for the rationale behind the new metrics:
> * max_error shows how far the clock is from the true time, and maybe it's 
> time to use other set of NTP servers or instead increase the 
> {{\-\-max_clock_sync_error_usec}} flag value
> * presence of the extrapolation intervals for the hybrid clock signals about 
> periods of non-availability for NTP servers, and possible action would be 
> re-visiting the set of NTP servers
> * if hybrid timestamps are being extrapolated for some time, Kudu masters and 
> tablet servers might crash if the clock errors eventually goes beyond the 
> configured threshold: it's time to start troubleshooting the issue to avoid 
> possible non-availability of the cluster
> The new metrics can be used for monitoring and alerting, allowing for 
> pro-active maintenance of a Kudu cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to