Re: Tradeoffs for Cassandra transaction management

bened...@apache.org Wed, 13 Oct 2021 04:40:32 -0700

I just realised my email client hid a lot of your email, so I now realise I 
must have misunderstood your statement. I realise now you must have meant 
per-statement snapshot isolation. However, I believe that MVCC is an 
optimisation for such an isolation level, not a requirement – it is possible 
(like Calvin, and for distributed consensus protocols) to serialize their 
execution, unless I’m missing something.

Otherwise I agree entirely with your email, now I’ve read it 😊

From: bened...@apache.org <bened...@apache.org>
Date: Wednesday, 13 October 2021 at 08:52
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Alex,

I hugely value and respect your input here, but I think in this case you may be 
mistaken.

Postgres[1] makes explicit that subsequent SELECT statements may see different 
data, and SQL Server[2] does the same. I believe the Oracle documents you 
reference do the same, but are more obtuse. They say that read committed is a 
statement-level isolation level, i.e. “read committed isolation level, this 
point is the time at which the statement was opened” though for read only 
transactions they upgrade this to transaction level isolation. Indeed, the ANSI 
SQL document you reference also supports this meaning: “Non-repeatable read” is 
defined only to be used later in a table on page 68 that defines READ COMMITTED 
as permitting these to occur.

Accord offers READ COMMITTED out of the box, essentially (modulo 
read-your-writes).

[1] https://www.postgresql.org/docs/7.2/xact-read-committed.html
[2] 
https://docs.microsoft.com/en-us/sql/connect/jdbc/understanding-isolation-levels?view=sql-server-ver15

From: Alex Miller <millerde...@gmail.com>
Date: Wednesday, 13 October 2021 at 08:07
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <henrik.i...@datastax.com> wrote:
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
>     [snip]
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.

These two pieces together seem to imply that your claim is that Read
Committed may read whatever the most recently committed data during
the execution of the statement and does not require MVCC.  Though I
agree that the standard[1] is very unclear as to what a "read" means
when defining a non-repeatable read:

>  2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-
>           transaction T2 then modifies or deletes that row and performs
>           a COMMIT. If T1 then attempts to reread the row, it may receive
>            the modified value or discover that the row has been deleted.

The common implementation is that Read Committed reads from a snapshot
of the database state.  The documentation of various database
implementations are much more clear about this.  See, for example,
Oracle[2] or MySQL[3] on the subject.

So I believe Read Committed would also require MVCC support in the
storage engine the same way that Snapshot Isolation would.

[1]: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
[2]: https://docs.oracle.com/cd/E25054_01/server.1111/e25789/consist.htm
, see section "Read Consistency in the Read Committed Isolation Level"
[3]: 
https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <henrik.i...@datastax.com> wrote:
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.

As I've pondered this over time, I personally specifically fault
read-your-uncommitted-writes as the reason why NewSQL databases are
essentially a design monoculture.  Every database persists uncommitted
writes in the database itself during execution.  Doing so encourages
those writes to be re-used for concurrency control (ie. write
intents), and then that places you in the exact client-driven 3PC
protocol which Cockroach, TiDB, and YugaByte all implement.  Even if
you find some radically different database for SQL, like leanXcale, it
_still_ persists uncommitted writes into the database.

And every time I've thought through this, I tend to agree.  It's too
exceedingly easy to write a SQL query which will exceed any local
limit imposed by memory, and it's too easy to write a query which runs
fine in production for a while, until it hits a memory limit and
begins to immediately fail.  There's a tremendous implementation
difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's
relatively hard to explain that to naive users.

Memory constraints aside, merging a local write cache into remote data
for execution seems like it'd be quite a task.   Any desire for
efficient distributed query execution would push for a design where
query fragments can be pushed down to the nodes holding the data.  I
imagine that one would then need to distribute all writes out to each
partition along with the query fragment for them to execute, so that
they can merge the pending writes in with the existing data, but such
a solution also places a significant overhead burden on the database.
Clients need to resend all potentially relevant writes to servers on
each statement, and servers need to hold all of a client’s writes in
memory as they execute.  The alternative of trying to very calculate a
subset of the result affected by the local writes and union it into
the query executed without the local writes feels prohibitively
complex.  Both directions seem fraught with peril.

But the write intent approach makes this conveniently easy, as any
server that has a row of data, also has all the uncommitted rows from
currently in progress transactions, and thus can easily filter to the
correct row as part of its MVCC implementation.  Faced with these two
options, I understand why the world has chosen that write intents are
a great solution, but the monoculture does make me a bit sad.

On Tue, Oct 12, 2021 at 5:20 PM Jonathan Ellis <jbel...@gmail.com> wrote:
> without having the entire logic of the transaction available to it, the server
> cannot retry the transaction when concurrent changes are found to have
> been applied after the reconnaissance reads (what you call the
> conversational phase).

This is not true at Read Committed!  The major advantage of Read
Committed over Repeatable Read is that by giving up a consistent read
timestamp from statement to statement, you give the server the ability
to retry your statement's execution multiple times, and wait out
conflicting transactions between each attempt.  Entire transactions
don't need to be retried when a conflict is encountered, only the
statement itself which encountered the conflict.  This is largely
pedantic nitpicking, as your question applied as Repeatable Read and
above, server-side retry is an important (and expected) Read Committed
optimization!

But more related to your point, suppose an (Accord) transaction
attempts to re-validate its reconnaissance reads, fails, and aborts.
The client that submitted the transaction then notices that its
transaction failed, and re-runs reconnaissance, and re-submits the
transaction.  It does so in a loop until one of its transactions
report successfully executing/applying.  Do you consider that an
interactive transaction?  If so, then I would wish to clarify that
this isn't a question of _if_ a deterministic database can support
interactive transactions, it's one of how efficiently can they be
supported.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Reply via email to