I feel like I should volunteer to write about MongoDB transactions. TL;DR Snapshot Isolation and Causal Consistency using Raft'ish, Lamport clock and 2PC. This leads to the age old discussion whether users really want serializability or not.
On Wed, Sep 22, 2021 at 1:44 AM Jonathan Ellis <jbel...@gmail.com> wrote: > The whitepaper here is a good description of the consensus algorithm itself > as well as its robustness and stability characteristics, and its comparison > with other state-of-the-art consensus algorithms is very useful. In the > context of Cassandra, where a consensus algorithm is only part of what will > be implemented, I'd like to see a more complete evaluation of the > transactional side of things as well, including performance characteristics > as well as the types of transactions that can be supported and at least a > general idea of what it would look like applied to Cassandra. This will > allow the PMC to make a more informed decision about what tradeoffs are > best for the entire long-term project of first supplementing and ultimately > replacing LWT. > > (Allowing users to mix LWT and AP Cassandra operations against the same > rows was probably a mistake, so in contrast with LWT we’re not looking for > something fast enough for occasional use but rather something within a > reasonable factor of AP operations, appropriate to being the only way to > interact with tables declared as such.) > > Besides Accord, this should cover > > - Calvin and FaunaDB > - A Spanner derivative (no opinion on whether that should be Cockroach or > Yugabyte, I don’t think it’s necessary to cover both) > - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect > there is more public information about MongoDB) > - RAMP > > =MongoDB= References: Presentation: https://www.youtube.com/watch?v=quFheFrLLGQ Slides: http://henrikingo.github.io/presentations/HighLoad%202019%20-%20Distributed%20transactions%20top%20to%20bottom/index.html#/step-1 Lamport implementation: http://delivery.acm.org/10.1145/3320000/3314049/p636-tyulenev.pdf Replication: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf and https://www.usenix.org/system/files/nsdi21-zhou.pdf TPC-C benchmark: http://www.vldb.org/pvldb/vol12/p2254-kamsky.pdf (Nothing published on cross shard trx...) Approach and Guarantees: Shards are independent replica sets, multi shard transactions and queries handled through a coordinator aka query router. Replica sets are Raft-like, so leader-based. When using 2PC, also the 2PC coordinator is a replica set, so that the coordinator state is made durable via majority commits. This means that a cross shard transaction actually needs 4 majority commits, but it would be possible to reduce latency to client ack to 2 commits (https://jira.mongodb.org/browse/SERVER-47130) Because of this the trx-coordinator is also its own recovery manager and it is assumed that the replica set will always be able to recover from failures, usually quickly. Cluster time is a Lamport clock, in practice the implementation is to generate use unix timestamp+counter to generate monotonically increasing integers. Time is passed along each message, and each recipient, updates its own cluster time to the higher timestamp. All nodes, including clients participate this. Causal Consistency is basically a client asking to read at or later than its current timestamp. A replica will block if needed to satisfy this request. The lamport clock is incremented by leaders to ensure progress in the absence of write transactions. The storage engine provides MVCC semantics. Extending this to the replication system is straightforward, since replicas apply transactions serially in the same order. For cross shard transactions it's the job of the transaction coordinator to commit the transaction with the same cluster time on all shards. If I remember correctly in the 2PC phase it will simply choose the timestamp returned by each shard as the global transaction timestamp. Combined, MongoDB transactions are snapshot isolation + causal consistency. Performance: 2PC is used only if a transaction actually has multiple participating shards. It is possible though not fun or realistic to specify partition boundaries so that related records from two collections will always reside on the same shard. The 2PC protocol actually requires 4 majority commits, although as of MongoDB 5.0, client only waits for 3. Majority commit is exactly what QUORUM is in Cassandra, so in a multi-DC cluster, commit waits for replication latency. Notably, single shard transactions parallelize well, because conflicting transactions can execute on the leader, even when the majority commit isn't yet finished. (This involves some speculative execution optimization.) I don't believe the same is true for cross shard transactions using 2PC. The paper by Asya Kamsky uses a single replica set and reports 60-70k TPM for a non-standard TPC-C where varying client threads was allowed and schema was modified to take advantage of denormalization in a document model. I'm not aware of benchmarks for cross shard transactions ,nor would I expect such results to be great. The philosophy there has been that cross shard transactions are expected to be a minority. Functionality and limitations: MongoDB's approach has been similar in spirit to what we can observe in RDBMS market. Even if MySQL (since forever) and PostgreSQL (2011) provide serializeable isolation, it is not default, and it's hard to find a single user who ever wanted to use it. Snapshot Isolation and Causal Consistency are considered the optimal tradeoff between good consistency and performance, and minimal hassle with lots of aborted transactions. The typical MongoDB user is like the typical MySQL and PostgreSQL user happy with this. It is possible to emulate SELECT FOR UPDATE by using findAndModify, which will turn your writes to a read and therefore take a write lock on all touched records. Note that first versions of MongoDB transactions got quite bad Jepsen review. This was mostly a function of none of the above guarantees being default, and the client API being really confusing, so most users - including Kyle Kingsbury and yours truly - would struggle to get all parameters right to actually enjoy the above mentioned guarantees. This is a sober reminder that this is complex stuff to get right end to end. Note that MongoDB also supports linearizeable writes and reads, but only on a per-record basis. Linearizeable is not available for transactions. It should be noted MongoDB's approach allows for interactive transactions. Application to Cassandra: D. Replication being leader based is a poor fit for expectations of a typical Cassandra user. It's hard to predict whether a typical Cassandra workload can expect cross-partition transactions to be the exceptional case, but my instinct says no. The Lamport clock and the causal consistency it provides is simple to understand and could be a building block in a transactional Cassandra cluster. My personal opinion is that a "synchronized timestamp" (or Hybrid Logical Clock I guess?) scheme like in Accord is more familiar to current Cassandra semantics. henrik -- Henrik Ingo +358 40 569 7354 <358405697354> [image: Visit us online.] <https://www.datastax.com/> [image: Visit us on Twitter.] <https://twitter.com/DataStaxEng> [image: Visit us on YouTube.] <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=> [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>