Re: [DISCUSS] CEP-15: General Purpose Transactions

Henrik Ingo Wed, 22 Sep 2021 07:08:10 -0700

I feel like I should volunteer to write about MongoDB transactions.

TL;DR Snapshot Isolation and Causal Consistency using Raft'ish, Lamport
clock and 2PC. This leads to the age old discussion whether users really
want serializability or not.

On Wed, Sep 22, 2021 at 1:44 AM Jonathan Ellis <jbel...@gmail.com> wrote:

> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful.  In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
>
=MongoDB=

References:
Presentation: https://www.youtube.com/watch?v=quFheFrLLGQ
Slides:
http://henrikingo.github.io/presentations/HighLoad%202019%20-%20Distributed%20transactions%20top%20to%20bottom/index.html#/step-1
Lamport implementation:
http://delivery.acm.org/10.1145/3320000/3314049/p636-tyulenev.pdf
Replication: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf and
https://www.usenix.org/system/files/nsdi21-zhou.pdf
TPC-C benchmark: http://www.vldb.org/pvldb/vol12/p2254-kamsky.pdf
(Nothing published on cross shard trx...)

Approach and Guarantees: Shards are independent replica sets, multi shard
transactions and queries handled through a coordinator aka query router.
Replica sets are Raft-like, so leader-based. When using 2PC, also the 2PC
coordinator is a replica set, so that the coordinator state is made durable
via majority commits. This means that a cross shard transaction actually
needs 4 majority commits, but it would be possible to reduce latency to
client ack to 2 commits (https://jira.mongodb.org/browse/SERVER-47130)
Because of this the trx-coordinator is also its own recovery manager and it
is assumed that the replica set will always be able to recover from
failures, usually quickly.

Cluster time is a Lamport clock, in practice the implementation is to
generate use unix timestamp+counter to generate monotonically increasing
integers. Time is passed along each message, and each recipient, updates
its own cluster time to the higher timestamp. All nodes, including clients
participate this. Causal Consistency is basically a client asking to read
at or later than its current timestamp. A replica will block if needed to
satisfy this request. The lamport clock is incremented by leaders to ensure
progress in the absence of write transactions.

The storage engine provides MVCC semantics. Extending this to the
replication system is straightforward, since replicas apply transactions
serially in the same order. For cross shard transactions it's the job of
the transaction coordinator to commit the transaction with the same cluster
time on all shards. If I remember correctly in the 2PC phase it will simply
choose the timestamp returned by each shard as the global transaction
timestamp. Combined, MongoDB transactions are snapshot isolation + causal
consistency.

Performance: 2PC is used only if a transaction actually has multiple
participating shards. It is possible though not fun or realistic to specify
partition boundaries so that related records from two collections will
always reside on the same shard. The 2PC protocol actually requires 4
majority commits, although as of MongoDB 5.0, client only waits for 3.
Majority commit is exactly what QUORUM is in Cassandra, so in a multi-DC
cluster, commit waits for replication latency. Notably, single shard
transactions parallelize well, because conflicting transactions can execute
on the leader, even when the majority commit isn't yet finished. (This
involves some speculative execution optimization.) I don't believe the same
is true for cross shard transactions using 2PC.

The paper by Asya Kamsky uses a single replica set and reports 60-70k TPM
for a non-standard TPC-C where varying client threads was allowed and
schema was modified to take advantage of denormalization in a document
model. I'm not aware of benchmarks for cross shard transactions ,nor would
I expect such results to be great. The philosophy there has been that cross
shard transactions are expected to be a minority.

Functionality and limitations: MongoDB's approach has been similar in
spirit to what we can observe in RDBMS market. Even if MySQL (since
forever) and PostgreSQL (2011) provide serializeable isolation, it is not
default, and it's hard to find a single user who ever wanted to use it.
Snapshot Isolation and Causal Consistency are considered the optimal
tradeoff between good consistency and performance, and minimal hassle with
lots of aborted transactions. The typical MongoDB user is like the typical
MySQL and PostgreSQL user happy with this. It is possible to emulate SELECT
FOR UPDATE by using findAndModify, which will turn your writes to a read
and therefore take a write lock on all touched records.

Note that first versions of MongoDB transactions got quite bad Jepsen
review. This was mostly a function of none of the above guarantees being
default, and the client API being really confusing, so most users -
including Kyle Kingsbury and yours truly - would struggle to get all
parameters right to actually enjoy the above mentioned guarantees. This is
a sober reminder that this is complex stuff to get right end to end.

Note that MongoDB also supports linearizeable writes and reads, but only on
a per-record basis. Linearizeable is not available for transactions.

It should be noted MongoDB's approach allows for interactive transactions.

Application to Cassandra: D.

Replication being leader based is a poor fit for expectations of a typical
Cassandra user. It's hard to predict whether a typical Cassandra workload
can expect cross-partition transactions to be the exceptional case, but my
instinct says no. The Lamport clock and the causal consistency it provides
is simple to understand and could be a building block in a transactional
Cassandra cluster. My personal opinion is that a "synchronized timestamp"
(or Hybrid Logical Clock I guess?) scheme like in Accord is more familiar
to current Cassandra semantics.

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Reply via email to