Jordan West created CASSANDRA-17251:
---------------------------------------

             Summary: USING writetime + ttl is non-idempotent leading to 
non-deterministic merge iteration results
                 Key: CASSANDRA-17251
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
             Project: Cassandra
          Issue Type: Bug
            Reporter: Jordan West
            Assignee: Jordan West


The combination of {{USING writetime = timestamp and ttl = ttl}} can result in 
non-deterministic MergeIterator results causing DigestMismatchExceptions and 
increased latencies. The increased latencies are caused by additional round 
trips due to the digest mismatch as well as read repair rewriting the data. The 
additional writes lead to an increase in the number of sstables the key is 
stored in and must be scanned on read.

The order of events is:
1. for a given partition a write is performed with {{USING timestamp = sometime 
and ttl = ttl1}}.
2. Cassandra records this write with timestamp = sometime, ttl = ttl1, 
expires_at = now + ttl1
3. after N seconds, for the same partition, another write is performed with 
{{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This write 
only makes it to a subset of replicas* for some reason (e.g. partial write, 
node down, etc).
4. Cassandra records this write with timestamp = sometime, ttl = ttl2, 
expires_at = now + ttl2. Its important to note that at this point, expires_at 
in 2 above is equal to expires at here. This is because it is calculated 
relative to the current write time not the provided timestamp and the ttl has 
been adjusted by the time passed. This write also makes it to a subset of 
replicas*.
5. A read of the data is performed.
5a. The MergeIterator resolves conflicts locally (accross sstables) using 
{{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution takes 
into account the write timestamp , the liveness of the cell, the values 
themselves, and how much time is left to live via the expires_at field. In this 
scenario, all of these fields are equal, leading to Cassandra picking the 
sstable "on the right" – this is non-deterministic. The only item that differs 
is the ttl itself. 
5b. One node returns the non-deterministically chosen value for the row, the 
other two calculate and send a digest to the coordinator. The digest includes 
the relative ttl field which may not match. This results in a 
DigestMismatchException at the coordinator.
6. Read repair is triggered 

*NOTE: its not strictly necessary for the write to make it to a subset of 
replicas. sstables can also be ordered in random orders for reasons like 
compaction or repair when returned from the live set which can lead to the same 
behavior. This also affects repair from what we can tell. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to