Re: [DISCUSS] Bringing SQL to Cassandra

Alex Miller Sat, 15 Nov 2025 21:47:36 -0800

To lead with the summary, there's two main points in the below.

First, unsurprisingly, Claude generated incomplete and buggy code for
implementing SQL. Great to see we all haven't been completely replaced
by LLMs yet. Pointing out some of what's missing or done wrong is
hopefully useful to discuss the gap in work between what an existing
postgres OLTP application needs and what the example code shows.

Second, layering SQL on top of a key-value database, even
a transactional one, will be slow. Layering on CQL is a bit nicer in
that it makes doing certain forms of pushdown optimizations easy, but
doesn't fundamentaly change that firm boundaries between SQL and the
rest of the system leads to worse performance. Implementing SQL with the
goal of being competitive with existing NewSQL solutions on performance
will lead to a different set of design requirements than just offering
SQL as convenience to users as a more powerful CQL. Percolator runs
fine on just a linearizable key-value store, and Accord allows some nice
simplifications and optimizations, but it doesn't change what's feasible
for implementing SQL, just its performance.

I had also, coincidentally, been incrementally chipping away at a blog
post on the subject of SQL on key-value stores, and its designs,
tradeoffs, constraints, and secretly the overarching point that a strict
layering is terrible for performance. Some of the structure and content
here is borrowed from that draft, but in an effort to not _entirely_ be
my verbose self, you're getting the rough outline of some of those
thoughts on designs intermixed with the points on Claude's specific
curious decisions. Happy to hammer out the fuller version of any
sub-topic if there's interest or questions.

### SQL Isolation Level

There's three popular choices of isolation levels: Serializable
(CockroachDB), Repeatable Read aka Snapshot Isolation (TiDB & Yugabyte),
and Read Committed, which is actually two different isolation levels:
ANSI Read Committed which the standard specifies but nearly no one
implements, and Read Committed Snapshot Isolation[1] which nearly every
database implements as Read Committed.

What Claude implemented describes itself as Serializable, is implemented
in the style of Snapshot Isolation, but actually only offers some half
broken form of ANSI Read Committed. Specifically, `commitTs == null`
indicates _either_ a key that is uncommitted, _or_ a key that has been
committed yet hasn't been asynchronously updated post-commit yet. As far
as I can tell, KvStore.get()[2] seems to reject both cases, and thus it
should only yields read values as the async post-commit process rewrites
them. That looks like best effort transaction atomicity. (If anyone is
wondering why TPC-C would run fine, TPC-C doesn't do any correctness
validation.)

Read Committed Snapshot Isolation is a very popular choice in OLTP
applications[3]. It's the strongest isolation that still permits the
server to retry statements upon conflicts without aborting the entire
transaction, and means clients don't really have transaction retry
loops. Even Cockroach, who took the approach of Serializability By
Default, was forced to later return and offer Read Committed to help
users port existing postgres applications. I specifically suggest
against trying to offer serializability: cockroach can only do it
because they sacrifice abort-free read-only transactions. The standard
percolator / client-driven three phase commit algorithm can't do better
than snapshot isolation.

[1]:
https://sqlperformance.com/2014/05/t-sql-queries/read-committed-snapshot-isolation
[2]:
https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvStore.java#L199
[3]: https://www.cs.cmu.edu/~pavlo/slides/pavlo-keynote-sigmod2017.pdf#page=17

### SQL Transaction Model and Protocol

Statements can read the writes of previous statements in the same
transaction. Transactions cannot read the writes of other transaction's
completed statements until the transaction commits. This means there's
two levels of "committing": a statement commits becoming visible to
later statements within a transaction, and a transaction commits
becoming visible to later transactions within the database.

Thus, our SQL transaction model is either:
* The beginning of each statement is a savepoint. At any time during
execution, the transaction can roll back to the savepoint, undoing
the effects of a statement.
* Each statement is a nested transaction within the parent SQL transaction.

And note that SQL doesn't specify what happens to a transaction when
a statement fails. MySQL rolls back the statement, but the transaction
continues. Postgres aborts the entire transaction if a statement fails.
(The same goes for triggers.)

Claude seems to have forgotten that multi-statement transactions exist.
What's implemented in the percolator model is a one-shot transaction
model. Supporting SQL requires notably extending what's implemented to
support a long-lived transaction status record that can be pointed to
across statements, and maintains the status of individual statements as
well. (Though optimizing for single-statement autocommit is also wise.)

Claude also just did something way more complicated than what's actually
needed. There are four tables involved in its percolator implementation:
kv_store, kv_writes, tx_locks, tx_writes. You only need one. The whole
percolator trick is around using the write of a value _as_ the lock
request. You need to write the value anyway. The lock request needs to
be durable so that it can't be lost. If you need to write a value
anyway, it might as well be the SQL value you're intending to write with
a bit of extra metadata rather than doing a separate lock write followed
by a sql value write as claude did[4]. Reads filter out the uncommitted
write data anyway. One round of writes can be dropped from the write
protocol.

[4]:
https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvTransactionCoordinator.java#L381

### SQL Concurrency Control

OCC SQL has a long history of turning out poorly. Cockroach started with
OCC, and moved to locking. TiDB started with OCC, and moved to locking.
Claude thinks Cassandra should start with OCC, and I think it should
move to locking. The difference here is somewhat simple: the `IF
existing_lock IS NULL AND conflict IS NULL`[5] conditional can be
removed, and lock requests are always inserted. The winner is the oldest
lock request. I do not seem to be finding a mechanism in Cassandra to be
notified of when a key is updated, but conceptually, one then identifies
the next younger lock request and waits for its "primary" record
/ transaction status record to have a commit outcome for the transaction
before the statement restarts. Jordan Isaacs wrote a nice post[6]
explaining the old RDBMS trick that continuing writes past the first
conflict permits bounding the number of retry attempts.

I'm also almost offended that Claude didn't even try to do the one
massive trick possible by having a whole transactional database
underlying percolator: just fold the entire commit protocol into one
Accord transaction. Cockroach had introduced parallel commits as the
small transaction optimization for a percolator-y commit protocol, and
TiDB copied the idea and referred to it as 1PC[8] (which, and
I absolutely cannot stress enough, _does not have one phase_). With
actual transactions already in the system, one can take the same idea,
but just fold it into a transaction that validates that no locks are
held as part of the read set, and the write set installs values directly
in the committed form with commit timestamps fully filled in.

The MVCC version filtering is one of the most painful parts to
read/watch in a SQL-on-KV system. Having to pull all the versions of
a frequently modified key across the wire just to throw all but one of
them away hurts. Getting CQL to return you all uncommitted rows and only
the maximum committed row is likely possible and helps notably. But all
the normal tricks that one plays here to make read amplification not as
bad (undo log, separate spaces for old versions, time separated btrees,
whatever) all can't be applied when laying on top of another system. It
has to be vertically integrated in.

MVCC cleanup also becomes an expensive process when layered on top of
another database. Claude took the only possible approach, of having
a background worker periodically scan the full database and perform
needed cleanup as DELETEs. Contrast this with Yugabyte's cleanup
protocol[9] which records the specific tablets that need to participate
in cleanup for each transaction (as a succinct summary of the keys
involved), and folds the MVCC cleanup into LSM compaction to avoid the
additional penalty of full data scans and adding more writes and deletes
as cleanup work. Accord letting one apply cleanup logic to all versions
of a key atomically is very nice though. MVCC cleanup racing with reads
is a huge source of bugs, and made significantly more complicated when
the reads can see a half cleaned up state of a logical key.

I would also expect that the Accord timestamp that's used for ordering
statements an be used as the timestamp for SQL transactions? If there's
not support for writing that value into a record as part of an Accord
commit, it's a great feature to have.

[5]:
https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvTransactionCoordinator.java#L321
[6]: https://snowytrees.dev/blog/bounding-write-write-conflict-retries/
[7]: https://www.cockroachlabs.com/blog/parallel-commits/
[8]: https://pingcap.github.io/tidb-dev-guide/understand-tidb/1pc.html
[9]:
https://docs.yugabyte.com/stable/architecture/transactions/transactional-io-path/#asynchronously-apply-and-clean-up-provisional-records

### SQL Features

I think SQL has done a really amazing job at designing a set of features
such that implementing the core of SQL in a completely sane and
reasonable way locks you into design and code decisions which break
under more esoteric SQL features.

To give a few random examples:

* Foreign keys are a fun example of where SQL can't fit into the
restrictions of one Accord transaction. Accord can perform a set of
reads and then a conditional set of writes. Foreign keys require
a round of reads to generate what to write, a round of writes to write
those values, and another round of reads to validate the constraints
hold for the written values. (A table can FK to itself in SQL, and
thus generate both a constraint which needs to hold and newly fulfill
the constraint in one statement.)
* Deferred constraints introduces fun sets of complications, my favorite
being that primary key uniqueness is a constraint, and can be
deferred. In the implementation, the primary key may not be unique
within a transaction (it just must be unique for committed data).
* Secondary index backfill under MVCC leaves you wanting to be able to
write values into the past so that the index data is at the same
version as the data it indexes.

SQL features and features to support SQL optimizations felt like a never
ending story of unexpected design requirements you only discover too
late. Leaving yourself escape hatches of additional bytes in per-key,
per-transaction and per-table metadata is very wise (or just the ability
to add more metadata if needed). If there's any idea of "we're
supporting this subset of SQL features and definitely not this subset of
SQL features", that's a great thing to establish up front so you know
what to plan for.

(And don't forget that NULL != NULL evaluates to true, so UNIQUE indexes
store every row that's NULL for the indexed value![10])

[10]: https://raymondtukpe.com/sql-nulls-are-weird.html

### Postgres-isms

If it looks like postgres, users will try to use it like it is postgres,
and there is an ever deep chasm between "postgres-like" and "behaves
exactly like postgres". Many postgres tools rely specifically on the
postgres system tables allowing schema introspection. You can observe
CedarDB's suffering on this topic[11]. Even Postgres's interpretation of
some SQL itself is troubling, and Jamie Brandon has kept "SELECT wat
FROM sql"[12] going cataloging some of the horrors he hit working on
materialize. Also expect that postgres datatypes won't line up with
Cassandra datatypes in some silly way, probably timestamp/timezone
handling or unicode or float conversions or some other standard corner
of silly compatibility bugs.

[11]: https://cedardb.com/docs/compatibility/system_table/
[12]: https://www.scattered-thoughts.net/writing/select-wat-from-sql/

### The Rest

Note that most of what I've discussed above is just the bottom half of
databases, and even that isn't complete. Cockroach has a really cute
trick of letting later statements in a transaction begin executing
before replication finishes for former ones, safely. Distributed SQL
execution is a deep topic, often deeply intertwined with storage format
for best performance. Even just JOINs are hard because they represent
handling an exploding number of intermediate results and multiple stages
of execution, but to Jeff's point about the choice of language, there's
an argument that some Apple folk can wander over to their local Spark
team and go "so... how hard was shipping datafusion in the JVM for
Comet?" and not _super_ have to worry about that.

The above is mostly just the chunks I've worked on/around before, but
all of that left the impression that competing on OLTP performance seems
intractable when your competition gets to break abstraction layers and
you don't. Cassandra does seem to let one register custom code
plugin-style for other features already (e.g. Triggers), and how much of
that can be used/extended to permit the abstraction breaking necessary
for performance is what I'd be staring at closely if someone told me to
go implement SQL for Cassandra. ;)

Re: [DISCUSS] Bringing SQL to Cassandra

Reply via email to