[jira] [Commented] (CASSANDRA-16710) Read repairs can break row isolation

Benjamin Lerer (Jira) Fri, 04 Jun 2021 09:30:07 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17357473#comment-17357473
 ]


Benjamin Lerer commented on CASSANDRA-16710:
--------------------------------------------

If I am not mistaken, even after CASSANDRA-10657 for normal queries C* is still 
fetching all the regular columns (it is required to ensure the CQL semantic 
around empty rows) so the data should be there. It is simply that read repair 
will only repair the queried columns.
   

> Read repairs can break row isolation
> ------------------------------------
>
>                 Key: CASSANDRA-16710
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16710
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Consistency/Coordination
>            Reporter: Samuel Klock
>            Priority: Urgent
>             Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> This issue essentially revives CASSANDRA-8287, was resolved "Later" in 2015.  
> While it was possible in principle at that time for read repair to break row 
> isolation, that couldn't happen in practice because Cassandra always pulled 
> all of the columns for each row in response to regular reads, so read repairs 
> would never partially resolve a row.  CASSANDRA-10657 modified Cassandra to 
> only pull the requested columns for reads, which enabled read repair to break 
> row isolation in practice.
> Note also that this is distinct from CASSANDRA-14593 (for read repair 
> breaking partition-level isolation): that issue (as we understand it) 
> captures isolation being broken across multiple rows within an update to a 
> partition, while this issue covers broken isolation across multiple columns 
> within an update to a single row.
> This behavior is easy to reproduce under affected versions using {{ccm}}:
> {code:bash}
> ccm create -n 3 -v $VERSION rrtest
> ccm updateconf -y 'hinted_handoff_enabled: false
> max_hint_window_in_ms: 0'
> ccm start
> (cat <<EOF
> CREATE KEYSPACE IF NOT EXISTS rrtest WITH REPLICATION = {'class': 
> 'SimpleStrategy', 'replication_factor': '3'};
> CREATE TABLE IF NOT EXISTS rrtest.kv (key TEXT PRIMARY KEY, col1 TEXT, col2 
> INT);
> CONSISTENCY ALL;
> INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'a', 1);
> EOF
> ) | ccm node1 cqlsh
> ccm node3 stop
> (cat <<EOF
> CONSISTENCY QUORUM;
> INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'b', 2);
> EOF
> ) | ccm node1 cqlsh
> ccm node3 start
> ccm node2 stop
> (cat <<EOF
> CONSISTENCY QUORUM;
> SELECT key, col1 FROM rrtest.kv WHERE key = 'key';
> EOF
> ) | ccm node1 cqlsh
> ccm node1 stop
> (cat <<EOF
> CONSISTENCY ONE;
> SELECT * FROM rrtest.kv WHERE key = 'key';
> EOF
> ) | ccm node3 cqlsh
> {code}
> This snippet creates a three-node cluster with an RF=3 keyspace containing a 
> table with three columns: a partition key and two value columns.  (Hinted 
> handoff can mask the problem if the repro steps are executed in quick 
> succession, so the snippet disables it for this exercise.)  Then:
> # It adds a full row to the table with values ('a', 1), ensuring it's 
> replicated to all three nodes.
> # It stops a node, then replaces the initial row with new values ('b', 2) in 
> a single update, ensuring that it's replicated to both available nodes.
> # It starts the node that was down, then stops one of the other nodes and 
> performs a quorum read just for the letter column.  The read observes 'b'.
> # Finally, it stops the other node that observed the second update, then 
> performs a CL=ONE read of the entire row on the node that was down for that 
> update.
> If read repair respects row isolation, then the final read should observe 
> ('b', 2).  (('a', 1) is also acceptable if we're willing to sacrifice 
> monotonicity.)
> * With {{VERSION=3.0.24}}, the final read observes ('b', 2), as expected.
> * With {{VERSION=3.11.10}} and {{VERSION=4.0-rc1}}, the final read instead 
> observes ('b', 1).  The same is true for 3.0.24 if CASSANDRA-10657 is 
> backported to it.
> The scenario above is somewhat contrived in that it supposes multiple read 
> workflows consulting different sets of columns with different consistency 
> levels.  Under 3.11, asynchronous read repair makes this scenario possible 
> even using just CL=ONE -- and with speculative retry, even if 
> {{read_repair_chance}}/{{dclocal_read_repair_chance}} are both zeroed.  We 
> haven't looked closely at 4.0, but even though (as we understand it) it lacks 
> async read repair, scenarios like CL=ONE writes or failed, 
> partially-committed CL>ONE writes create some surface area for this behavior, 
> even without mixed consistency/column reads.
> Given the importance of paging to reads from wide partitions, it makes some 
> intuitive sense that applications shouldn't rely on isolation at the 
> partition level.  Being unable to rely on row isolation is much more 
> surprising, especially given that (modulo the possibility of other atomicity 
> bugs) Cassandra did preserve it before 3.11.  Cassandra should either find a 
> solution for this in code (e.g., when performing a read repair, always 
> operate over all of the columns for the table, regardless of what was 
> originally requested for a read) or at least update its documentation to 
> include appropriate caveats about update isolation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-16710) Read repairs can break row isolation

Reply via email to