C. Scott Andreas updated CASSANDRA-8732:
    Component/s: Streaming and Messaging

> Make inter-node timeouts tolerate clock skew and drift
> ------------------------------------------------------
>                 Key: CASSANDRA-8732
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Streaming and Messaging
>            Reporter: Ariel Weisberg
>            Priority: Major
>         Attachments: maximalskew.png
> Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
> sure that tasks don't expire before they arrive.
> Every receiver needs to deduce the offset between its nanoTime and the remote 
> nanoTime. I don't think currentTimeMillis is a good choice because it is 
> designed to be manipulated by operators and NTP. I would probably be 
> comfortable assuming that nanoTime isn't going to move in significant ways 
> without something that could be classified as operator error happening.
> I suspect the one timing method you can rely on being accurate is nanoTime 
> within a node (on average) and that a node can report on its own scheduling 
> jitter (on average).
> Finding the offset requires knowing what the network latency is in one 
> direction.
> One way to do this would be to periodically send a ping request which 
> generates a series of ping responses at fixed intervals (maybe by UDP?). The 
> responses should corrected for scheduling jitter since the fixed intervals 
> may not be exactly achieved by the sender. By measuring the time deviation 
> between ping responses and their expected arrival time (based on the 
> interval) and correcting for the remotely reported scheduling jitter, you 
> should be able to measure latency in one direction.
> A weighted moving average (only correct for drift, not readjustment) of these 
> measurements would eventually converge on a close answer and would not be 
> impacted by outlier measurements. It may also make sense to drop the largest 
> N samples to improve accuracy.
> One you know network latency you can add that to the timestamp of each ping 
> and compare to the local clock and know what the offset is.
> These measurements won't calculate the offset to be too small (timeouts fire 
> early), but could calculate the offset to be too large (timeouts fire late). 
> The conditions where you the offset won't be accurate are the conditions 
> where you also want them firing reliably. This and bootstrapping in bad 
> conditions is what I am most uncertain of.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to