[ 
https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13780267#comment-13780267
 ] 

Christopher Smith commented on CASSANDRA-6106:
----------------------------------------------

Regarding resyncing the microsOffset. The reality is that the only problem is 
if the clock skew gets bigger than the "adjust" window of NTP or whatever other 
clock adjusting happens in the cluster (let's call that "sigma"). As a client, 
based solely on the contracts exposed by Cassandra's documentation, if I want a 
write to win out over a previous write X, I have a few options:

1) Use client generated timestamps both for write X and my current write, and 
ensure by other means my write has a higher timestamp value than all others.
2) Make sure I do my write more than whatever the cluster's "sigma" is since X.
3) Send all my writes for the cells in question are sent to a specific node 
(which potentially means cross-data center writes).
4) After write X, for all columns I want to write to, I do a "select 
writetime(columnA), writetime(columnB) from ..." with consistency level high 
enough to exceed replication factor when added to write X's consistency level 
(for multi-datacenter clusters, that pretty much means both operations are 
QUORUM, or the select is done with consistency ALL), and then do a write with a 
client timestamp value that is greater than all of the timestamps I get back.

Guarantees about ordering of writes are hard to come by, and if you don't use 
client generated timestamps, you basically should assume that two writes could 
resolve either way unless they are greater than "sigma" apart, and "sigma" is 
usually a big enough number that at that point that it is mostly irrelevant.

The tl;dr is: unless you are explicit about ordering client side, you can't be 
certain about the ordering of writes. All you can be certain of is atomicity 
(once this bug is fixed ;-).

So, to a degree, clock skew isn't terribly important (clients just can't count 
on how server time stamps will end up resolving the order of writes), but if 
we're going to adjust it, there probably needs to be a user adjustable value 
for "sigma" that controls how often the clock gets resync'd.

Does that sound reasonable?
                
> QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current 
> timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() 
> / 1000
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6106
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6106
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE Cassandra 3.1, but also HEAD
>            Reporter: Christopher Smith
>            Priority: Minor
>              Labels: collision, conflict, timestamp
>         Attachments: microtimstamp.patch
>
>
> I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra 
> mentioned issues with millisecond rounding in timestamps and was able to 
> reproduce the issue. If I specify a timestamp in a mutating query, I get 
> microsecond precision, but if I don't, I get timestamps rounded to the 
> nearest millisecond, at least for my first query on a given connection, which 
> substantially increases the possibilities of collision.
> I believe I found the offending code, though I am by no means sure this is 
> comprehensive. I think we probably need a fairly comprehensive replacement of 
> all uses of System.currentTimeMillis() with System.nanoTime().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to