[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results

2022-01-10 Thread Jordan West (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan West updated CASSANDRA-17251:

Test and Documentation Plan: Added {{ConflictsTest}}
 Status: Patch Available  (was: Open)

https://github.com/apache/cassandra/compare/trunk...jrwest:jwest/17251-3.0

> USING writetime + ttl is non-idempotent leading to non-deterministic merge 
> iteration results
> 
>
> Key: CASSANDRA-17251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Jordan West
>Assignee: Jordan West
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> The combination of {{USING writetime = timestamp and ttl = ttl}} can result 
> in non-deterministic MergeIterator results causing DigestMismatchExceptions 
> and increased latencies. The increased latencies are caused by additional 
> round trips due to the digest mismatch as well as read repair rewriting the 
> data. The additional writes lead to an increase in the number of sstables the 
> key is stored in and must be scanned on read.
> The order of events is:
> 1. for a given partition a write is performed with {{USING timestamp = 
> sometime and ttl = ttl1}}.
> 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, 
> expires_at = now + ttl1
> 3. after N seconds, for the same partition, another write is performed with 
> {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This 
> write only makes it to a subset of replicas* for some reason (e.g. partial 
> write, node down, etc).
> 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, 
> expires_at = now + ttl2. Its important to note that at this point, expires_at 
> in 2 above is equal to expires at here. This is because it is calculated 
> relative to the current write time not the provided timestamp and the ttl has 
> been adjusted by the time passed. This write also makes it to a subset of 
> replicas*.
> 5. A read of the data is performed.
> 5a. The MergeIterator resolves conflicts locally (accross sstables) using 
> {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution 
> takes into account the write timestamp , the liveness of the cell, the values 
> themselves, and how much time is left to live via the expires_at field. In 
> this scenario, all of these fields are equal, leading to Cassandra picking 
> the sstable "on the right" – this is non-deterministic. The only item that 
> differs is the ttl itself. 
> 5b. One node returns the non-deterministically chosen value for the row, the 
> other two calculate and send a digest to the coordinator. The digest includes 
> the relative ttl field which may not match. This results in a 
> DigestMismatchException at the coordinator.
> 6. Read repair is triggered 
> *NOTE: its not strictly necessary for the write to make it to a subset of 
> replicas. sstables can also be ordered in random orders for reasons like 
> compaction or repair when returned from the live set which can lead to the 
> same behavior. This also affects repair from what we can tell. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results

2022-01-10 Thread Jordan West (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan West updated CASSANDRA-17251:

 Bug Category: Parent values: Correctness(12982)Level 1 values: 
Consistency(12989)
   Complexity: Normal
  Component/s: Local/Other
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> USING writetime + ttl is non-idempotent leading to non-deterministic merge 
> iteration results
> 
>
> Key: CASSANDRA-17251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Other
>Reporter: Jordan West
>Assignee: Jordan West
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> The combination of {{USING writetime = timestamp and ttl = ttl}} can result 
> in non-deterministic MergeIterator results causing DigestMismatchExceptions 
> and increased latencies. The increased latencies are caused by additional 
> round trips due to the digest mismatch as well as read repair rewriting the 
> data. The additional writes lead to an increase in the number of sstables the 
> key is stored in and must be scanned on read.
> The order of events is:
> 1. for a given partition a write is performed with {{USING timestamp = 
> sometime and ttl = ttl1}}.
> 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, 
> expires_at = now + ttl1
> 3. after N seconds, for the same partition, another write is performed with 
> {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This 
> write only makes it to a subset of replicas* for some reason (e.g. partial 
> write, node down, etc).
> 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, 
> expires_at = now + ttl2. Its important to note that at this point, expires_at 
> in 2 above is equal to expires at here. This is because it is calculated 
> relative to the current write time not the provided timestamp and the ttl has 
> been adjusted by the time passed. This write also makes it to a subset of 
> replicas*.
> 5. A read of the data is performed.
> 5a. The MergeIterator resolves conflicts locally (accross sstables) using 
> {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution 
> takes into account the write timestamp , the liveness of the cell, the values 
> themselves, and how much time is left to live via the expires_at field. In 
> this scenario, all of these fields are equal, leading to Cassandra picking 
> the sstable "on the right" – this is non-deterministic. The only item that 
> differs is the ttl itself. 
> 5b. One node returns the non-deterministically chosen value for the row, the 
> other two calculate and send a digest to the coordinator. The digest includes 
> the relative ttl field which may not match. This results in a 
> DigestMismatchException at the coordinator.
> 6. Read repair is triggered 
> *NOTE: its not strictly necessary for the write to make it to a subset of 
> replicas. sstables can also be ordered in random orders for reasons like 
> compaction or repair when returned from the live set which can lead to the 
> same behavior. This also affects repair from what we can tell. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results

2022-01-10 Thread Jordan West (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jordan West updated CASSANDRA-17251:

Fix Version/s: 3.0.x
   3.11.x
   4.0.x

> USING writetime + ttl is non-idempotent leading to non-deterministic merge 
> iteration results
> 
>
> Key: CASSANDRA-17251
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17251
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Jordan West
>Assignee: Jordan West
>Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> The combination of {{USING writetime = timestamp and ttl = ttl}} can result 
> in non-deterministic MergeIterator results causing DigestMismatchExceptions 
> and increased latencies. The increased latencies are caused by additional 
> round trips due to the digest mismatch as well as read repair rewriting the 
> data. The additional writes lead to an increase in the number of sstables the 
> key is stored in and must be scanned on read.
> The order of events is:
> 1. for a given partition a write is performed with {{USING timestamp = 
> sometime and ttl = ttl1}}.
> 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, 
> expires_at = now + ttl1
> 3. after N seconds, for the same partition, another write is performed with 
> {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This 
> write only makes it to a subset of replicas* for some reason (e.g. partial 
> write, node down, etc).
> 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, 
> expires_at = now + ttl2. Its important to note that at this point, expires_at 
> in 2 above is equal to expires at here. This is because it is calculated 
> relative to the current write time not the provided timestamp and the ttl has 
> been adjusted by the time passed. This write also makes it to a subset of 
> replicas*.
> 5. A read of the data is performed.
> 5a. The MergeIterator resolves conflicts locally (accross sstables) using 
> {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution 
> takes into account the write timestamp , the liveness of the cell, the values 
> themselves, and how much time is left to live via the expires_at field. In 
> this scenario, all of these fields are equal, leading to Cassandra picking 
> the sstable "on the right" – this is non-deterministic. The only item that 
> differs is the ttl itself. 
> 5b. One node returns the non-deterministically chosen value for the row, the 
> other two calculate and send a digest to the coordinator. The digest includes 
> the relative ttl field which may not match. This results in a 
> DigestMismatchException at the coordinator.
> 6. Read repair is triggered 
> *NOTE: its not strictly necessary for the write to make it to a subset of 
> replicas. sstables can also be ordered in random orders for reasons like 
> compaction or repair when returned from the live set which can lead to the 
> same behavior. This also affects repair from what we can tell. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org