[
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ariel Weisberg updated CASSANDRA-8732:
--------------------------------------
Summary: Make inter-node timeouts tolerate clock skew and drift (was: Make
inter-node timeouts tolerate time skew)
> Make inter-node timeouts tolerate clock skew and drift
> ------------------------------------------------------
>
> Key: CASSANDRA-8732
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Ariel Weisberg
>
> Right now internode timeouts rely on currentTimeMillis() (and NTP) to make
> sure that tasks don't expire before they arrive.
> Every receiver needs to deduce the offset between its nanoTime and the remote
> nanoTime. I don't think currentTimeMillis is a good choice because it is
> designed to be manipulated by operators and NTP. I would probably be
> comfortable assuming that nanoTime isn't going to move in significant ways
> without something that could be classified as operator error happening.
> I suspect the one timing method you can rely on being accurate is nanoTime
> within a node (on average) and that a node can report on its own scheduling
> jitter (on average).
> Finding the offset requires knowing what the network latency is in one
> direction.
> One way to do this would be to periodically send a ping request which
> generates a series of ping responses at fixed intervals (maybe by UDP?). The
> responses should corrected for scheduling jitter since the fixed intervals
> may not be exactly achieved by the sender. By measuring the time deviation
> between ping responses and their expected arrival time (based on the
> interval) and correcting for the remotely reported scheduling jitter, you
> should be able to measure latency in one direction.
> A weighted moving average (only correct for drift, not readjustment) of these
> measurements would eventually converge on a close answer and would not be
> impacted by outlier measurements. It may also make sense to drop the largest
> N samples to improve accuracy.
> One you know network latency you can add that to the timestamp of each ping
> and compare to the local clock and know what the offset is.
> These measurements won't calculate the offset to be too small (timeouts fire
> early), but could calculate the offset to be too large (timeouts fire late).
> The conditions where you the offset won't be accurate are the conditions
> where you also want them firing reliably. This and bootstrapping in bad
> conditions is what I am most uncertain of.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)