[
https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718825#comment-17718825
]
Stefan Miklosovic commented on CASSANDRA-17251:
-----------------------------------------------
Would you please prepare patches for all branches up to trunk, [~jwest] ? (if
this issue happens there as well)
> USING writetime + ttl is non-idempotent leading to non-deterministic merge
> iteration results
> --------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Other
> Reporter: Jordan West
> Assignee: Jordan West
> Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> The combination of {{USING writetime = timestamp and ttl = ttl}} can result
> in non-deterministic MergeIterator results causing DigestMismatchExceptions
> and increased latencies. The increased latencies are caused by additional
> round trips due to the digest mismatch as well as read repair rewriting the
> data. The additional writes lead to an increase in the number of sstables the
> key is stored in and must be scanned on read.
> The order of events is:
> 1. for a given partition a write is performed with {{USING timestamp =
> sometime and ttl = ttl1}}.
> 2. Cassandra records this write with timestamp = sometime, ttl = ttl1,
> expires_at = now + ttl1
> 3. after N seconds, for the same partition, another write is performed with
> {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This
> write only makes it to a subset of replicas* for some reason (e.g. partial
> write, node down, etc).
> 4. Cassandra records this write with timestamp = sometime, ttl = ttl2,
> expires_at = now + ttl2. Its important to note that at this point, expires_at
> in 2 above is equal to expires at here. This is because it is calculated
> relative to the current write time not the provided timestamp and the ttl has
> been adjusted by the time passed. This write also makes it to a subset of
> replicas*.
> 5. A read of the data is performed.
> 5a. The MergeIterator resolves conflicts locally (accross sstables) using
> {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution
> takes into account the write timestamp , the liveness of the cell, the values
> themselves, and how much time is left to live via the expires_at field. In
> this scenario, all of these fields are equal, leading to Cassandra picking
> the sstable "on the right" – this is non-deterministic. The only item that
> differs is the ttl itself.
> 5b. One node returns the non-deterministically chosen value for the row, the
> other two calculate and send a digest to the coordinator. The digest includes
> the relative ttl field which may not match. This results in a
> DigestMismatchException at the coordinator.
> 6. Read repair is triggered
> *NOTE: its not strictly necessary for the write to make it to a subset of
> replicas. sstables can also be ordered in random orders for reasons like
> compaction or repair when returned from the live set which can lead to the
> same behavior. This also affects repair from what we can tell.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]