[ https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17472374#comment-17472374 ]
Jordan West edited comment on CASSANDRA-17251 at 1/11/22, 12:43 AM: -------------------------------------------------------------------- https://github.com/apache/cassandra/compare/cassandra-3.0...jrwest:jwest/17251-3.0 was (Author: jrwest): https://github.com/apache/cassandra/compare/trunk...jrwest:jwest/17251-3.0 > USING writetime + ttl is non-idempotent leading to non-deterministic merge > iteration results > -------------------------------------------------------------------------------------------- > > Key: CASSANDRA-17251 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17251 > Project: Cassandra > Issue Type: Bug > Components: Local/Other > Reporter: Jordan West > Assignee: Jordan West > Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > > The combination of {{USING writetime = timestamp and ttl = ttl}} can result > in non-deterministic MergeIterator results causing DigestMismatchExceptions > and increased latencies. The increased latencies are caused by additional > round trips due to the digest mismatch as well as read repair rewriting the > data. The additional writes lead to an increase in the number of sstables the > key is stored in and must be scanned on read. > The order of events is: > 1. for a given partition a write is performed with {{USING timestamp = > sometime and ttl = ttl1}}. > 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, > expires_at = now + ttl1 > 3. after N seconds, for the same partition, another write is performed with > {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This > write only makes it to a subset of replicas* for some reason (e.g. partial > write, node down, etc). > 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, > expires_at = now + ttl2. Its important to note that at this point, expires_at > in 2 above is equal to expires at here. This is because it is calculated > relative to the current write time not the provided timestamp and the ttl has > been adjusted by the time passed. This write also makes it to a subset of > replicas*. > 5. A read of the data is performed. > 5a. The MergeIterator resolves conflicts locally (accross sstables) using > {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution > takes into account the write timestamp , the liveness of the cell, the values > themselves, and how much time is left to live via the expires_at field. In > this scenario, all of these fields are equal, leading to Cassandra picking > the sstable "on the right" – this is non-deterministic. The only item that > differs is the ttl itself. > 5b. One node returns the non-deterministically chosen value for the row, the > other two calculate and send a digest to the coordinator. The digest includes > the relative ttl field which may not match. This results in a > DigestMismatchException at the coordinator. > 6. Read repair is triggered > *NOTE: its not strictly necessary for the write to make it to a subset of > replicas. sstables can also be ordered in random orders for reasons like > compaction or repair when returned from the live set which can lead to the > same behavior. This also affects repair from what we can tell. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org