[
https://issues.apache.org/jira/browse/CASSANDRA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17358469#comment-17358469
]
Benjamin Lerer edited comment on CASSANDRA-16710 at 6/7/21, 11:43 AM:
----------------------------------------------------------------------
{quote} IIUC we can’t have row isolation, queries not fetching all the columns
and read repair at the same time.{quote}
Due to the CQL semantics, C* need to be able to distinguish between having a
row (with potentially only null value sfor the non primary key columns and not
having a row at all). In order to do that C* need to fetch a different set of
columns than what is actually queried.
If my understanding of CASSANDRA-10657 is correct, the patch did 2 majors
changes.
# it optimized the amount of data being fetched by *skipping the cells that
were not queried and had a timestamp lower than the one of the row*
# it ensured that *only the queried columns* where being repaired by read-repair
The point 1) means that in order for some columns to not be fetched they have
to be older than the latest {{INSERT}} (as only {{INSERT}} statements update
the primary key timestamp). By consequence, in order to break row isolation at
the *fetched column level*, we will need to have a scenario where some of the
queried columns belong to the same mutation than the one being skipped.
It seems to me that we could avoid that problem by skipping only the columns
that have a timestamp lower than the lowest one of the queried columns (if we
have at least one queried column with a timestamp lower than the row one).
Once the fetched columns are guaranty to preserve row isolation the read-repair
logic should be changed to repair all the fetched columns.
was (Author: blerer):
{quote} IIUC we can’t have row isolation, queries not fetching all the columns
and read repair at the same time.{quote}
Due to the CQL semantics, C* need to be able to distinguish between having a
row (with potentially only null valuesfor the non primary key columns) and not
having a row at all. In order to do that C* need to fetch a different set of
columns than what is actually queried.
If my understanding of CASSANDRA-10657, the patch did 2 majors changed.
# it optimized the amount of data being fetched by *skipping the cells that
were not queried and had a timestamp lower than the one of the row*
# it ensured that *only the queried columns* where being repaired by read-repair
The point 1) means that in order for some columns to not be fetched they had to
be older than the latest {{INSERT}} (as only {{INSERT}} statements update the
primary key timestamp). By consequence, in order to break row isolation at the
*fetched column level*, we will need to have a scenario where some of the
queried columns belong to the same mutation than the one being skipped.
It seems to me that we could avoid that problem by skipping only the columns
that have a timestamp lower than the lowest one of the queried columns (if we
have at least one queried column with a timestamp lower than the row one).
Once the fetched columns are guaranty to preserve row isolation the read-repair
logic should be changed to repair all the fetched columns.
> Read repairs can break row isolation
> ------------------------------------
>
> Key: CASSANDRA-16710
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16710
> Project: Cassandra
> Issue Type: Bug
> Components: Consistency/Coordination
> Reporter: Samuel Klock
> Assignee: Benjamin Lerer
> Priority: Urgent
> Fix For: 3.0.x, 3.11.x, 4.0.x
>
>
> This issue essentially revives CASSANDRA-8287, was resolved "Later" in 2015.
> While it was possible in principle at that time for read repair to break row
> isolation, that couldn't happen in practice because Cassandra always pulled
> all of the columns for each row in response to regular reads, so read repairs
> would never partially resolve a row. CASSANDRA-10657 modified Cassandra to
> only pull the requested columns for reads, which enabled read repair to break
> row isolation in practice.
> Note also that this is distinct from CASSANDRA-14593 (for read repair
> breaking partition-level isolation): that issue (as we understand it)
> captures isolation being broken across multiple rows within an update to a
> partition, while this issue covers broken isolation across multiple columns
> within an update to a single row.
> This behavior is easy to reproduce under affected versions using {{ccm}}:
> {code:bash}
> ccm create -n 3 -v $VERSION rrtest
> ccm updateconf -y 'hinted_handoff_enabled: false
> max_hint_window_in_ms: 0'
> ccm start
> (cat <<EOF
> CREATE KEYSPACE IF NOT EXISTS rrtest WITH REPLICATION = {'class':
> 'SimpleStrategy', 'replication_factor': '3'};
> CREATE TABLE IF NOT EXISTS rrtest.kv (key TEXT PRIMARY KEY, col1 TEXT, col2
> INT);
> CONSISTENCY ALL;
> INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'a', 1);
> EOF
> ) | ccm node1 cqlsh
> ccm node3 stop
> (cat <<EOF
> CONSISTENCY QUORUM;
> INSERT INTO rrtest.kv (key, col1, col2) VALUES ('key', 'b', 2);
> EOF
> ) | ccm node1 cqlsh
> ccm node3 start
> ccm node2 stop
> (cat <<EOF
> CONSISTENCY QUORUM;
> SELECT key, col1 FROM rrtest.kv WHERE key = 'key';
> EOF
> ) | ccm node1 cqlsh
> ccm node1 stop
> (cat <<EOF
> CONSISTENCY ONE;
> SELECT * FROM rrtest.kv WHERE key = 'key';
> EOF
> ) | ccm node3 cqlsh
> {code}
> This snippet creates a three-node cluster with an RF=3 keyspace containing a
> table with three columns: a partition key and two value columns. (Hinted
> handoff can mask the problem if the repro steps are executed in quick
> succession, so the snippet disables it for this exercise.) Then:
> # It adds a full row to the table with values ('a', 1), ensuring it's
> replicated to all three nodes.
> # It stops a node, then replaces the initial row with new values ('b', 2) in
> a single update, ensuring that it's replicated to both available nodes.
> # It starts the node that was down, then stops one of the other nodes and
> performs a quorum read just for the letter column. The read observes 'b'.
> # Finally, it stops the other node that observed the second update, then
> performs a CL=ONE read of the entire row on the node that was down for that
> update.
> If read repair respects row isolation, then the final read should observe
> ('b', 2). (('a', 1) is also acceptable if we're willing to sacrifice
> monotonicity.)
> * With {{VERSION=3.0.24}}, the final read observes ('b', 2), as expected.
> * With {{VERSION=3.11.10}} and {{VERSION=4.0-rc1}}, the final read instead
> observes ('b', 1). The same is true for 3.0.24 if CASSANDRA-10657 is
> backported to it.
> The scenario above is somewhat contrived in that it supposes multiple read
> workflows consulting different sets of columns with different consistency
> levels. Under 3.11, asynchronous read repair makes this scenario possible
> even using just CL=ONE -- and with speculative retry, even if
> {{read_repair_chance}}/{{dclocal_read_repair_chance}} are both zeroed. We
> haven't looked closely at 4.0, but even though (as we understand it) it lacks
> async read repair, scenarios like CL=ONE writes or failed,
> partially-committed CL>ONE writes create some surface area for this behavior,
> even without mixed consistency/column reads.
> Given the importance of paging to reads from wide partitions, it makes some
> intuitive sense that applications shouldn't rely on isolation at the
> partition level. Being unable to rely on row isolation is much more
> surprising, especially given that (modulo the possibility of other atomicity
> bugs) Cassandra did preserve it before 3.11. Cassandra should either find a
> solution for this in code (e.g., when performing a read repair, always
> operate over all of the columns for the table, regardless of what was
> originally requested for a read) or at least update its documentation to
> include appropriate caveats about update isolation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]