[ 
https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779804#comment-13779804
 ] 

Sylvain Lebresne commented on CASSANDRA-6106:
---------------------------------------------

I agree it would be nice to ensure that it's always the cells of the same user 
update that wins on a timestamp tie between 2 user updates.

I think we all agree however that better resolution timestamps does not solve 
it, it just make timestamp tie less likely.

So I suppose there is 2 "questions":
# can we actually fix the real problem?
# given that even if we do find an acceptable way to fix it, it's unlikely to 
be a simple, minor-release kind of change, what about improving the timestamp 
resolution in the meantime?

The 2nd question is easier to answer. My opinion is that if we have a simple 
way to improve the resolution with no particular downside, then why not.  But 
concerning the attached patch, the one thing I'm not totally sure about is that 
it seems to "freeze" clock drift at the JVM startup, even if said clock drift 
is fixed by ntpd afterwards. Though I'm assuming here that nanoTime is not 
affected by ntpd adjusting the system clock which may be a wrong assumption in 
the first place, I haven't really checked tbh. And of course, in a healthy 
environments, one would hope that nptd makes sure no system clock ever drift 
enough on any node that it makes a meaningful difference to the application. 
Still, committing a patch that don't really fix a problem but only makes it 
less likely is only a no brainer if there is no downside whatsoever and I'm 
just not totally clar on the no downside part.


Reguarding question 1, what we need (if I'm not mistaken) is a way to break 
timestamp ties that ensures that given 2 user updates u1 and u2, either all 
cells from u1 wins or they all lose when resolved against cells of u2.

For what, one suggestion could be to extend the cell timestamp to include an id 
of the cordinator of the update. It's enough to have a coordinator id to 
distinguish timestamp ties because it is relatively easy to ensure a given 
coordinator never issues updates on the same timestamp (we ensure that only per 
client connection so far, but making it global to the coordinator is pretty 
easy). I "think" that may not be very far from Christopher suggestion above, 
though I don't think we need to mess with datacenters and whatnot. Namely, now 
that we can do CAS operations, it should be easy to assign a unique id per host 
in the Cassandra that is short (i.e. we can support 65K nodes with 2 bytes).

Of course it's not a totally trivial change, but I don't think it's really that 
complex to do either (I'm relatively confident I can get that done for 2.1 for 
instance if we decide to go for it). At first glance, the only downside I see 
would be that it'll add a 2 bytes overhead per cell. But I'm not really sure 
this would have much measurable impact in practice (given compression in 
particular) and there is not that complex ways to compact the current timestamp 
more that we do now if we really want to gain back a few bytes.

                
> QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current 
> timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() 
> / 1000
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-6106
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6106
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: DSE Cassandra 3.1, but also HEAD
>            Reporter: Christopher Smith
>            Priority: Minor
>              Labels: collision, conflict, timestamp
>         Attachments: microtimstamp.patch
>
>
> I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra 
> mentioned issues with millisecond rounding in timestamps and was able to 
> reproduce the issue. If I specify a timestamp in a mutating query, I get 
> microsecond precision, but if I don't, I get timestamps rounded to the 
> nearest millisecond, at least for my first query on a given connection, which 
> substantially increases the possibilities of collision.
> I believe I found the offending code, though I am by no means sure this is 
> comprehensive. I think we probably need a fairly comprehensive replacement of 
> all uses of System.currentTimeMillis() with System.nanoTime().

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to