[ 
https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779623#comment-13779623
 ] 

Christopher Smith commented on CASSANDRA-6106:
----------------------------------------------

Look at the above description and also look at the article. LWT doesn't fix 
this. You could use a vector clock, but then you have all the hell that comes 
with that.

I agree the "still possible" is really dumb and a violation of the guarantees 
that Cassandra documents. As long as Cassandra has this mechanism though, we 
should make the probabilities way, way lower. With this change the probability 
of a collision gets to around the kind of odds as UUID collisions, which I 
think for practical purposes is "good enough".

Note that the current "+1" trick also creates potentially backwards ordering 
problems (if you write 2 times in one millisecond to node A and once in the 
same millisecond to node B, the second write to node A is treated as having 
been last, even if it happened 999 microseconds before the write to node B).

Cassandra should use a different mechanism to resolve concurrent writes with 
the same timestamp. I would propose something more like this:

If two nodes have different values for a cell, but have the same timestamp for 
the cell:

1) Compute the "token" for the record.
2) Compute replicas 1 to N for that token and assign them those values 1 to N 
to each node in the datacenter.
3) If there is a tie, win goes to the replica with the node with the highest 
value for #2.
4) If there are two datacenters, each with the same highest value node (note 
this favours data centers with higher replication factors, which seems... good 
to me), you resolve in favour of the datacenter whose name alphasorts lowest.

                
> QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current 
> timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() 
> / 1000
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6106
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6106
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE Cassandra 3.1, but also HEAD
>            Reporter: Christopher Smith
>            Priority: Minor
>              Labels: collision, conflict, timestamp
>         Attachments: microtimstamp.patch
>
>
> I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra 
> mentioned issues with millisecond rounding in timestamps and was able to 
> reproduce the issue. If I specify a timestamp in a mutating query, I get 
> microsecond precision, but if I don't, I get timestamps rounded to the 
> nearest millisecond, at least for my first query on a given connection, which 
> substantially increases the possibilities of collision.
> I believe I found the offending code, though I am by no means sure this is 
> comprehensive. I think we probably need a fairly comprehensive replacement of 
> all uses of System.currentTimeMillis() with System.nanoTime().
> There seems to be some confusion here, so I'd like to clarify: the purpose of 
> this patch is NOT to improve the precision of ordering guarantees for 
> concurrent writes to cells. The purpose of this patch is to reduce the 
> probability that concurrent writes to cells are deemed as having occurred at 
> *the same time*, which is when Cassandra violates its atomicity guarantee.
> To clarify the failure scenario. Cassandra promises that writes to the same 
> record are "atomic", so if you do something like:
> create table foo {
> i int PRIMARY KEY,
> x int,
> y int,
> };
> and then send these two queries concurrently:
> insert into foo (i, x, y) values (1, 8, -8);
> insert into foo (i, x, y) values (1, -8, 8);
> you can't be quite sure which of the two writes will be the "last" one, but 
> you do know that if you do:
> select x, y from foo where i = 1;
> you don't know if x is "8" or "-8".
> you don't know if y is "-8" or "8".
> YOU DO KNOW: x + y will equal 0.
> EXCEPT... if the timestamps assigned to the two queries are *exactly* the 
> same, in which case x + y = 16. :-( Now your writes are not atomic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to