[ https://issues.apache.org/jira/browse/CASSANDRA-6106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13779804#comment-13779804 ]
Sylvain Lebresne commented on CASSANDRA-6106: --------------------------------------------- I agree it would be nice to ensure that it's always the cells of the same user update that wins on a timestamp tie between 2 user updates. I think we all agree however that better resolution timestamps does not solve it, it just make timestamp tie less likely. So I suppose there is 2 "questions": # can we actually fix the real problem? # given that even if we do find an acceptable way to fix it, it's unlikely to be a simple, minor-release kind of change, what about improving the timestamp resolution in the meantime? The 2nd question is easier to answer. My opinion is that if we have a simple way to improve the resolution with no particular downside, then why not. But concerning the attached patch, the one thing I'm not totally sure about is that it seems to "freeze" clock drift at the JVM startup, even if said clock drift is fixed by ntpd afterwards. Though I'm assuming here that nanoTime is not affected by ntpd adjusting the system clock which may be a wrong assumption in the first place, I haven't really checked tbh. And of course, in a healthy environments, one would hope that nptd makes sure no system clock ever drift enough on any node that it makes a meaningful difference to the application. Still, committing a patch that don't really fix a problem but only makes it less likely is only a no brainer if there is no downside whatsoever and I'm just not totally clar on the no downside part. Reguarding question 1, what we need (if I'm not mistaken) is a way to break timestamp ties that ensures that given 2 user updates u1 and u2, either all cells from u1 wins or they all lose when resolved against cells of u2. For what, one suggestion could be to extend the cell timestamp to include an id of the cordinator of the update. It's enough to have a coordinator id to distinguish timestamp ties because it is relatively easy to ensure a given coordinator never issues updates on the same timestamp (we ensure that only per client connection so far, but making it global to the coordinator is pretty easy). I "think" that may not be very far from Christopher suggestion above, though I don't think we need to mess with datacenters and whatnot. Namely, now that we can do CAS operations, it should be easy to assign a unique id per host in the Cassandra that is short (i.e. we can support 65K nodes with 2 bytes). Of course it's not a totally trivial change, but I don't think it's really that complex to do either (I'm relatively confident I can get that done for 2.1 for instance if we decide to go for it). At first glance, the only downside I see would be that it'll add a 2 bytes overhead per cell. But I'm not really sure this would have much measurable impact in practice (given compression in particular) and there is not that complex ways to compact the current timestamp more that we do now if we really want to gain back a few bytes. > QueryState.getTimestamp() & FBUtilities.timestampMicros() reads current > timestamp with System.currentTimeMillis() * 1000 instead of System.nanoTime() > / 1000 > ------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Key: CASSANDRA-6106 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6106 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DSE Cassandra 3.1, but also HEAD > Reporter: Christopher Smith > Priority: Minor > Labels: collision, conflict, timestamp > Attachments: microtimstamp.patch > > > I noticed this blog post: http://aphyr.com/posts/294-call-me-maybe-cassandra > mentioned issues with millisecond rounding in timestamps and was able to > reproduce the issue. If I specify a timestamp in a mutating query, I get > microsecond precision, but if I don't, I get timestamps rounded to the > nearest millisecond, at least for my first query on a given connection, which > substantially increases the possibilities of collision. > I believe I found the offending code, though I am by no means sure this is > comprehensive. I think we probably need a fairly comprehensive replacement of > all uses of System.currentTimeMillis() with System.nanoTime(). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira