To lead with the summary, there's two main points in the below. First, unsurprisingly, Claude generated incomplete and buggy code for implementing SQL. Great to see we all haven't been completely replaced by LLMs yet. Pointing out some of what's missing or done wrong is hopefully useful to discuss the gap in work between what an existing postgres OLTP application needs and what the example code shows.
Second, layering SQL on top of a key-value database, even a transactional one, will be slow. Layering on CQL is a bit nicer in that it makes doing certain forms of pushdown optimizations easy, but doesn't fundamentaly change that firm boundaries between SQL and the rest of the system leads to worse performance. Implementing SQL with the goal of being competitive with existing NewSQL solutions on performance will lead to a different set of design requirements than just offering SQL as convenience to users as a more powerful CQL. Percolator runs fine on just a linearizable key-value store, and Accord allows some nice simplifications and optimizations, but it doesn't change what's feasible for implementing SQL, just its performance. I had also, coincidentally, been incrementally chipping away at a blog post on the subject of SQL on key-value stores, and its designs, tradeoffs, constraints, and secretly the overarching point that a strict layering is terrible for performance. Some of the structure and content here is borrowed from that draft, but in an effort to not _entirely_ be my verbose self, you're getting the rough outline of some of those thoughts on designs intermixed with the points on Claude's specific curious decisions. Happy to hammer out the fuller version of any sub-topic if there's interest or questions. ### SQL Isolation Level There's three popular choices of isolation levels: Serializable (CockroachDB), Repeatable Read aka Snapshot Isolation (TiDB & Yugabyte), and Read Committed, which is actually two different isolation levels: ANSI Read Committed which the standard specifies but nearly no one implements, and Read Committed Snapshot Isolation[1] which nearly every database implements as Read Committed. What Claude implemented describes itself as Serializable, is implemented in the style of Snapshot Isolation, but actually only offers some half broken form of ANSI Read Committed. Specifically, `commitTs == null` indicates _either_ a key that is uncommitted, _or_ a key that has been committed yet hasn't been asynchronously updated post-commit yet. As far as I can tell, KvStore.get()[2] seems to reject both cases, and thus it should only yields read values as the async post-commit process rewrites them. That looks like best effort transaction atomicity. (If anyone is wondering why TPC-C would run fine, TPC-C doesn't do any correctness validation.) Read Committed Snapshot Isolation is a very popular choice in OLTP applications[3]. It's the strongest isolation that still permits the server to retry statements upon conflicts without aborting the entire transaction, and means clients don't really have transaction retry loops. Even Cockroach, who took the approach of Serializability By Default, was forced to later return and offer Read Committed to help users port existing postgres applications. I specifically suggest against trying to offer serializability: cockroach can only do it because they sacrifice abort-free read-only transactions. The standard percolator / client-driven three phase commit algorithm can't do better than snapshot isolation. [1]: https://sqlperformance.com/2014/05/t-sql-queries/read-committed-snapshot-isolation [2]: https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvStore.java#L199 [3]: https://www.cs.cmu.edu/~pavlo/slides/pavlo-keynote-sigmod2017.pdf#page=17 ### SQL Transaction Model and Protocol Statements can read the writes of previous statements in the same transaction. Transactions cannot read the writes of other transaction's completed statements until the transaction commits. This means there's two levels of "committing": a statement commits becoming visible to later statements within a transaction, and a transaction commits becoming visible to later transactions within the database. Thus, our SQL transaction model is either: * The beginning of each statement is a savepoint. At any time during execution, the transaction can roll back to the savepoint, undoing the effects of a statement. * Each statement is a nested transaction within the parent SQL transaction. And note that SQL doesn't specify what happens to a transaction when a statement fails. MySQL rolls back the statement, but the transaction continues. Postgres aborts the entire transaction if a statement fails. (The same goes for triggers.) Claude seems to have forgotten that multi-statement transactions exist. What's implemented in the percolator model is a one-shot transaction model. Supporting SQL requires notably extending what's implemented to support a long-lived transaction status record that can be pointed to across statements, and maintains the status of individual statements as well. (Though optimizing for single-statement autocommit is also wise.) Claude also just did something way more complicated than what's actually needed. There are four tables involved in its percolator implementation: kv_store, kv_writes, tx_locks, tx_writes. You only need one. The whole percolator trick is around using the write of a value _as_ the lock request. You need to write the value anyway. The lock request needs to be durable so that it can't be lost. If you need to write a value anyway, it might as well be the SQL value you're intending to write with a bit of extra metadata rather than doing a separate lock write followed by a sql value write as claude did[4]. Reads filter out the uncommitted write data anyway. One round of writes can be dropped from the write protocol. [4]: https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvTransactionCoordinator.java#L381 ### SQL Concurrency Control OCC SQL has a long history of turning out poorly. Cockroach started with OCC, and moved to locking. TiDB started with OCC, and moved to locking. Claude thinks Cassandra should start with OCC, and I think it should move to locking. The difference here is somewhat simple: the `IF existing_lock IS NULL AND conflict IS NULL`[5] conditional can be removed, and lock requests are always inserted. The winner is the oldest lock request. I do not seem to be finding a mechanism in Cassandra to be notified of when a key is updated, but conceptually, one then identifies the next younger lock request and waits for its "primary" record / transaction status record to have a commit outcome for the transaction before the statement restarts. Jordan Isaacs wrote a nice post[6] explaining the old RDBMS trick that continuing writes past the first conflict permits bounding the number of retry attempts. I'm also almost offended that Claude didn't even try to do the one massive trick possible by having a whole transactional database underlying percolator: just fold the entire commit protocol into one Accord transaction. Cockroach had introduced parallel commits as the small transaction optimization for a percolator-y commit protocol, and TiDB copied the idea and referred to it as 1PC[8] (which, and I absolutely cannot stress enough, _does not have one phase_). With actual transactions already in the system, one can take the same idea, but just fold it into a transaction that validates that no locks are held as part of the read set, and the write set installs values directly in the committed form with commit timestamps fully filled in. The MVCC version filtering is one of the most painful parts to read/watch in a SQL-on-KV system. Having to pull all the versions of a frequently modified key across the wire just to throw all but one of them away hurts. Getting CQL to return you all uncommitted rows and only the maximum committed row is likely possible and helps notably. But all the normal tricks that one plays here to make read amplification not as bad (undo log, separate spaces for old versions, time separated btrees, whatever) all can't be applied when laying on top of another system. It has to be vertically integrated in. MVCC cleanup also becomes an expensive process when layered on top of another database. Claude took the only possible approach, of having a background worker periodically scan the full database and perform needed cleanup as DELETEs. Contrast this with Yugabyte's cleanup protocol[9] which records the specific tablets that need to participate in cleanup for each transaction (as a succinct summary of the keys involved), and folds the MVCC cleanup into LSM compaction to avoid the additional penalty of full data scans and adding more writes and deletes as cleanup work. Accord letting one apply cleanup logic to all versions of a key atomically is very nice though. MVCC cleanup racing with reads is a huge source of bugs, and made significantly more complicated when the reads can see a half cleaned up state of a logical key. I would also expect that the Accord timestamp that's used for ordering statements an be used as the timestamp for SQL transactions? If there's not support for writing that value into a record as part of an Accord commit, it's a great feature to have. [5]: https://github.com/geico/cassandra-sql/blob/a0257ec9a22ff84daaf6f529ae8b523fdc45b431/src/main/java/com/geico/poc/cassandrasql/kv/KvTransactionCoordinator.java#L321 [6]: https://snowytrees.dev/blog/bounding-write-write-conflict-retries/ [7]: https://www.cockroachlabs.com/blog/parallel-commits/ [8]: https://pingcap.github.io/tidb-dev-guide/understand-tidb/1pc.html [9]: https://docs.yugabyte.com/stable/architecture/transactions/transactional-io-path/#asynchronously-apply-and-clean-up-provisional-records ### SQL Features I think SQL has done a really amazing job at designing a set of features such that implementing the core of SQL in a completely sane and reasonable way locks you into design and code decisions which break under more esoteric SQL features. To give a few random examples: * Foreign keys are a fun example of where SQL can't fit into the restrictions of one Accord transaction. Accord can perform a set of reads and then a conditional set of writes. Foreign keys require a round of reads to generate what to write, a round of writes to write those values, and another round of reads to validate the constraints hold for the written values. (A table can FK to itself in SQL, and thus generate both a constraint which needs to hold and newly fulfill the constraint in one statement.) * Deferred constraints introduces fun sets of complications, my favorite being that primary key uniqueness is a constraint, and can be deferred. In the implementation, the primary key may not be unique within a transaction (it just must be unique for committed data). * Secondary index backfill under MVCC leaves you wanting to be able to write values into the past so that the index data is at the same version as the data it indexes. SQL features and features to support SQL optimizations felt like a never ending story of unexpected design requirements you only discover too late. Leaving yourself escape hatches of additional bytes in per-key, per-transaction and per-table metadata is very wise (or just the ability to add more metadata if needed). If there's any idea of "we're supporting this subset of SQL features and definitely not this subset of SQL features", that's a great thing to establish up front so you know what to plan for. (And don't forget that NULL != NULL evaluates to true, so UNIQUE indexes store every row that's NULL for the indexed value![10]) [10]: https://raymondtukpe.com/sql-nulls-are-weird.html ### Postgres-isms If it looks like postgres, users will try to use it like it is postgres, and there is an ever deep chasm between "postgres-like" and "behaves exactly like postgres". Many postgres tools rely specifically on the postgres system tables allowing schema introspection. You can observe CedarDB's suffering on this topic[11]. Even Postgres's interpretation of some SQL itself is troubling, and Jamie Brandon has kept "SELECT wat FROM sql"[12] going cataloging some of the horrors he hit working on materialize. Also expect that postgres datatypes won't line up with Cassandra datatypes in some silly way, probably timestamp/timezone handling or unicode or float conversions or some other standard corner of silly compatibility bugs. [11]: https://cedardb.com/docs/compatibility/system_table/ [12]: https://www.scattered-thoughts.net/writing/select-wat-from-sql/ ### The Rest Note that most of what I've discussed above is just the bottom half of databases, and even that isn't complete. Cockroach has a really cute trick of letting later statements in a transaction begin executing before replication finishes for former ones, safely. Distributed SQL execution is a deep topic, often deeply intertwined with storage format for best performance. Even just JOINs are hard because they represent handling an exploding number of intermediate results and multiple stages of execution, but to Jeff's point about the choice of language, there's an argument that some Apple folk can wander over to their local Spark team and go "so... how hard was shipping datafusion in the JVM for Comet?" and not _super_ have to worry about that. The above is mostly just the chunks I've worked on/around before, but all of that left the impression that competing on OLTP performance seems intractable when your competition gets to break abstraction layers and you don't. Cassandra does seem to let one register custom code plugin-style for other features already (e.g. Triggers), and how much of that can be used/extended to permit the abstraction breaking necessary for performance is what I'd be staring at closely if someone told me to go implement SQL for Cassandra. ;)
