[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results
[ https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan West updated CASSANDRA-17251: Test and Documentation Plan: Added {{ConflictsTest}} Status: Patch Available (was: Open) https://github.com/apache/cassandra/compare/trunk...jrwest:jwest/17251-3.0 > USING writetime + ttl is non-idempotent leading to non-deterministic merge > iteration results > > > Key: CASSANDRA-17251 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17251 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Jordan West >Assignee: Jordan West >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > > The combination of {{USING writetime = timestamp and ttl = ttl}} can result > in non-deterministic MergeIterator results causing DigestMismatchExceptions > and increased latencies. The increased latencies are caused by additional > round trips due to the digest mismatch as well as read repair rewriting the > data. The additional writes lead to an increase in the number of sstables the > key is stored in and must be scanned on read. > The order of events is: > 1. for a given partition a write is performed with {{USING timestamp = > sometime and ttl = ttl1}}. > 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, > expires_at = now + ttl1 > 3. after N seconds, for the same partition, another write is performed with > {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This > write only makes it to a subset of replicas* for some reason (e.g. partial > write, node down, etc). > 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, > expires_at = now + ttl2. Its important to note that at this point, expires_at > in 2 above is equal to expires at here. This is because it is calculated > relative to the current write time not the provided timestamp and the ttl has > been adjusted by the time passed. This write also makes it to a subset of > replicas*. > 5. A read of the data is performed. > 5a. The MergeIterator resolves conflicts locally (accross sstables) using > {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution > takes into account the write timestamp , the liveness of the cell, the values > themselves, and how much time is left to live via the expires_at field. In > this scenario, all of these fields are equal, leading to Cassandra picking > the sstable "on the right" – this is non-deterministic. The only item that > differs is the ttl itself. > 5b. One node returns the non-deterministically chosen value for the row, the > other two calculate and send a digest to the coordinator. The digest includes > the relative ttl field which may not match. This results in a > DigestMismatchException at the coordinator. > 6. Read repair is triggered > *NOTE: its not strictly necessary for the write to make it to a subset of > replicas. sstables can also be ordered in random orders for reasons like > compaction or repair when returned from the live set which can lead to the > same behavior. This also affects repair from what we can tell. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results
[ https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan West updated CASSANDRA-17251: Bug Category: Parent values: Correctness(12982)Level 1 values: Consistency(12989) Complexity: Normal Component/s: Local/Other Discovered By: User Report Severity: Normal Status: Open (was: Triage Needed) > USING writetime + ttl is non-idempotent leading to non-deterministic merge > iteration results > > > Key: CASSANDRA-17251 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17251 > Project: Cassandra > Issue Type: Bug > Components: Local/Other >Reporter: Jordan West >Assignee: Jordan West >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > > The combination of {{USING writetime = timestamp and ttl = ttl}} can result > in non-deterministic MergeIterator results causing DigestMismatchExceptions > and increased latencies. The increased latencies are caused by additional > round trips due to the digest mismatch as well as read repair rewriting the > data. The additional writes lead to an increase in the number of sstables the > key is stored in and must be scanned on read. > The order of events is: > 1. for a given partition a write is performed with {{USING timestamp = > sometime and ttl = ttl1}}. > 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, > expires_at = now + ttl1 > 3. after N seconds, for the same partition, another write is performed with > {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This > write only makes it to a subset of replicas* for some reason (e.g. partial > write, node down, etc). > 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, > expires_at = now + ttl2. Its important to note that at this point, expires_at > in 2 above is equal to expires at here. This is because it is calculated > relative to the current write time not the provided timestamp and the ttl has > been adjusted by the time passed. This write also makes it to a subset of > replicas*. > 5. A read of the data is performed. > 5a. The MergeIterator resolves conflicts locally (accross sstables) using > {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution > takes into account the write timestamp , the liveness of the cell, the values > themselves, and how much time is left to live via the expires_at field. In > this scenario, all of these fields are equal, leading to Cassandra picking > the sstable "on the right" – this is non-deterministic. The only item that > differs is the ttl itself. > 5b. One node returns the non-deterministically chosen value for the row, the > other two calculate and send a digest to the coordinator. The digest includes > the relative ttl field which may not match. This results in a > DigestMismatchException at the coordinator. > 6. Read repair is triggered > *NOTE: its not strictly necessary for the write to make it to a subset of > replicas. sstables can also be ordered in random orders for reasons like > compaction or repair when returned from the live set which can lead to the > same behavior. This also affects repair from what we can tell. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-17251) USING writetime + ttl is non-idempotent leading to non-deterministic merge iteration results
[ https://issues.apache.org/jira/browse/CASSANDRA-17251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jordan West updated CASSANDRA-17251: Fix Version/s: 3.0.x 3.11.x 4.0.x > USING writetime + ttl is non-idempotent leading to non-deterministic merge > iteration results > > > Key: CASSANDRA-17251 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17251 > Project: Cassandra > Issue Type: Bug >Reporter: Jordan West >Assignee: Jordan West >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0.x > > > The combination of {{USING writetime = timestamp and ttl = ttl}} can result > in non-deterministic MergeIterator results causing DigestMismatchExceptions > and increased latencies. The increased latencies are caused by additional > round trips due to the digest mismatch as well as read repair rewriting the > data. The additional writes lead to an increase in the number of sstables the > key is stored in and must be scanned on read. > The order of events is: > 1. for a given partition a write is performed with {{USING timestamp = > sometime and ttl = ttl1}}. > 2. Cassandra records this write with timestamp = sometime, ttl = ttl1, > expires_at = now + ttl1 > 3. after N seconds, for the same partition, another write is performed with > {{USING timestamp = sometime and ttl = ttl2 where ttl2 = ttl1 - N}}. This > write only makes it to a subset of replicas* for some reason (e.g. partial > write, node down, etc). > 4. Cassandra records this write with timestamp = sometime, ttl = ttl2, > expires_at = now + ttl2. Its important to note that at this point, expires_at > in 2 above is equal to expires at here. This is because it is calculated > relative to the current write time not the provided timestamp and the ttl has > been adjusted by the time passed. This write also makes it to a subset of > replicas*. > 5. A read of the data is performed. > 5a. The MergeIterator resolves conflicts locally (accross sstables) using > {{Conflicts.resolveRegular}} or {{Cells.resolveRegular}}. The resolution > takes into account the write timestamp , the liveness of the cell, the values > themselves, and how much time is left to live via the expires_at field. In > this scenario, all of these fields are equal, leading to Cassandra picking > the sstable "on the right" – this is non-deterministic. The only item that > differs is the ttl itself. > 5b. One node returns the non-deterministically chosen value for the row, the > other two calculate and send a digest to the coordinator. The digest includes > the relative ttl field which may not match. This results in a > DigestMismatchException at the coordinator. > 6. Read repair is triggered > *NOTE: its not strictly necessary for the write to make it to a subset of > replicas. sstables can also be ordered in random orders for reasons like > compaction or repair when returned from the live set which can lead to the > same behavior. This also affects repair from what we can tell. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org