Jordan West created CASSANDRA-17251:
---------------------------------------
Summary: USING writetime + ttl is non-idempotent leading to
non-deterministic merge iteration results
Key: CASSANDRA-17251
URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
Project: Cassandra
Issue Type: Bug
Reporter: Jordan West
Assignee: Jordan West
The combination of {{USING writetime = timestamp and ttl = ttl}} can result in
non-deterministic MergeIterator results causing DigestMismatchExceptions and
increased latencies. The increased latencies are caused by additional round
trips due to the digest mismatch as well as read repair rewriting the data. The
additional writes lead to an increase in the number of sstables the key is
stored in and must be scanned on read.
The order of events is:
1. for a given partition a write is performed with {{USING timestamp = sometime
and ttl = ttl1}}.
2. Cassandra records this write with timestamp = sometime, ttl = ttl1,
expires_at = now + ttl1
3. after N seconds, for the same partition, another write is performed with
{{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This write
only makes it to a subset of replicas* for some reason (e.g. partial write,
node down, etc).
4. Cassandra records this write with timestamp = sometime, ttl = ttl2,
expires_at = now + ttl2. Its important to note that at this point, expires_at
in 2 above is equal to expires at here. This is because it is calculated
relative to the current write time not the provided timestamp and the ttl has
been adjusted by the time passed. This write also makes it to a subset of
replicas*.
5. A read of the data is performed.
5a. The MergeIterator resolves conflicts locally (accross sstables) using
{{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution takes
into account the write timestamp , the liveness of the cell, the values
themselves, and how much time is left to live via the expires_at field. In this
scenario, all of these fields are equal, leading to Cassandra picking the
sstable "on the right" – this is non-deterministic. The only item that differs
is the ttl itself.
5b. One node returns the non-deterministically chosen value for the row, the
other two calculate and send a digest to the coordinator. The digest includes
the relative ttl field which may not match. This results in a
DigestMismatchException at the coordinator.
6. Read repair is triggered
*NOTE: its not strictly necessary for the write to make it to a subset of
replicas. sstables can also be ordered in random orders for reasons like
compaction or repair when returned from the live set which can lead to the same
behavior. This also affects repair from what we can tell.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]