Essentially this, although I think in practice we will need to track each 
partition’s timestamp separately (or optionally for reduced conflicts, each row 
or datum’s), and make them all part of the conditional application of the 
transaction - at least for strict-serializability.

The alternative is to insert read/write intents for the transaction during each 
step, and to confirm they are still valid on commit, but this approach would 
require a WAN round-trip for each step in the interactive transaction, whereas 
the timestamp-validating approach can use a LAN round-trip for each step 
besides the final one, and is also much simpler to implement.


From: Blake Eggleston <beggles...@apple.com.INVALID>
Date: Thursday, 30 September 2021 at 05:47
To: dev@cassandra.apache.org <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
You could establish a lower timestamp bound and buffer transaction state on the 
coordinator, then make the commit an operation that only applies if all 
partitions involved haven’t been changed by a more recent timestamp. You could 
also implement mvcc either in the storage layer or for some period of time by 
buffering commits on each replica before applying.

> On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
> How are interactive transactions possible with Accord?
>
>
>
> On Tue, Sep 21, 2021 at 11:56 PM bened...@apache.org <bened...@apache.org>
> wrote:
>
>> Could you explain why you believe this trade-off is necessary? We can
>> support full SQL just fine with Accord, and I hope that we eventually do so.
>>
>> This domain is incredibly complex, so it is easy to reach wrong
>> conclusions. I would invite you again to propose a system for discussion
>> that you think offers something Accord is unable to, and that you consider
>> desirable, and we can work from there.
>>
>> To pre-empt some possible discussions, I am not aware of anything we
>> cannot do with Accord that we could do with either Calvin or Spanner.
>> Interactive transactions are possible on top of Accord, as are transactions
>> with an unknown read/write set. In each case the only cost is that they
>> would use optimistic concurrency control, which is no worse the spanner
>> derivatives anyway (which I have to assume is your benchmark in this
>> regard). I do not expect to deliver either functionality initially, but
>> Accord takes us most of the way there for both.
>>
>>
>> From: Jonathan Ellis <jbel...@gmail.com>
>> Date: Wednesday, 22 September 2021 at 05:36
>> To: dev <dev@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Right, I'm looking for exactly a discussion on the high level goals.
>> Instead of saying "here's the goals and we ruled out X because Y" we should
>> start with a discussion around, "Approach A allows X and W, approach B
>> allows Y and Z" and decide together what the goals should be and and what
>> we are willing to trade to get those goals, e.g., are we willing to give up
>> global strict serializability to get the ability to support full SQL.  Both
>> of these are nice to have!
>>
>> On Tue, Sep 21, 2021 at 9:52 PM bened...@apache.org <bened...@apache.org>
>> wrote:
>>
>>> Hi Jonathan,
>>>
>>> These other systems are incompatible with the goals of the CEP. I do
>>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
>>> summarise that discussion below. A true and accurate comparison of these
>>> other systems is essentially intractable, as there are complex subtleties
>>> to each flavour, and those who are interested would be better served by
>>> performing their own research.
>>>
>>> I think it is more productive to focus on what we want to achieve as a
>>> community. If you believe the goals of this CEP are wrong for the
>> project,
>>> let’s focus on that. If you want to compare and contrast specific facets
>> of
>>> alternative systems that you consider to be preferable in some dimension,
>>> let’s do that here or in a Q&A as proposed by Joey.
>>>
>>> The relevant goals are that we:
>>>
>>>
>>>  1.  Guarantee strict serializable isolation on commodity hardware
>>>  2.  Scale to any cluster size
>>>  3.  Achieve optimal latency
>>>
>>> The approach taken by Spanner derivatives is rejected by (1) because they
>>> guarantee only Serializable isolation (they additionally fail (3)). From
>>> watching talks by YugaByte, and inferring from Cockroach’s
>>> panic-cluster-death under clock skew, this is clearly considered by
>>> everyone to be undesirable but necessary to achieve scalability.
>>>
>>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
>>> sequencing layer requires a global leader process for the cluster, which
>> is
>>> incompatible with Cassandra’s scalability requirements. It additionally
>>> fails (3) for global clients.
>>>
>>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
>>> Spanner clone for its multi-key transaction functionality, not 2PC.
>>>
>>> Systems such as RAMP with even weaker isolation are not considered for
>> the
>>> simple reason that they do not even claim to meet (1).
>>>
>>> If we want to additionally offer weaker isolation levels than
>>> Serializable, such as that provided by the recent RAMP-TAO paper,
>> Cassandra
>>> is likely able to support multiple distinct transaction layers that
>> operate
>>> independently. I would encourage you to file a CEP to explore how we can
>>> meet these distinct use cases, but I consider them to be niche. I expect
>>> that a majority of our user base desire strict serializable isolation,
>> and
>>> certainly no less than serializable isolation, to augment the existing
>>> weaker isolation offered by quorum reads and writes.
>>>
>>> I would tangentially note that we are not an AP database under normal
>>> recommended operation. A minority in any network partition cannot reach
>>> QUORUM, so under recommended usage we are a high-availability leaderless
>> CP
>>> database.
>>>
>>>
>>> From: Jonathan Ellis <jbel...@gmail.com>
>>> Date: Tuesday, 21 September 2021 at 23:45
>>> To: dev <dev@cassandra.apache.org>
>>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>>> Benedict, thanks for taking the lead in putting this together. Since
>>> Cassandra is the only relevant database today designed around a
>> leaderless
>>> architecture, it's quite likely that we'll be better served with a custom
>>> transaction design instead of trying to retrofit one from CP systems.
>>>
>>> The whitepaper here is a good description of the consensus algorithm
>> itself
>>> as well as its robustness and stability characteristics, and its
>> comparison
>>> with other state-of-the-art consensus algorithms is very useful.  In the
>>> context of Cassandra, where a consensus algorithm is only part of what
>> will
>>> be implemented, I'd like to see a more complete evaluation of the
>>> transactional side of things as well, including performance
>> characteristics
>>> as well as the types of transactions that can be supported and at least a
>>> general idea of what it would look like applied to Cassandra. This will
>>> allow the PMC to make a more informed decision about what tradeoffs are
>>> best for the entire long-term project of first supplementing and
>> ultimately
>>> replacing LWT.
>>>
>>> (Allowing users to mix LWT and AP Cassandra operations against the same
>>> rows was probably a mistake, so in contrast with LWT we’re not looking
>> for
>>> something fast enough for occasional use but rather something within a
>>> reasonable factor of AP operations, appropriate to being the only way to
>>> interact with tables declared as such.)
>>>
>>> Besides Accord, this should cover
>>>
>>> - Calvin and FaunaDB
>>> - A Spanner derivative (no opinion on whether that should be Cockroach or
>>> Yugabyte, I don’t think it’s necessary to cover both)
>>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
>>> there is more public information about MongoDB)
>>> - RAMP
>>>
>>> Here’s an example of what I mean:
>>>
>>> =Calvin=
>>>
>>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
>>> transactions, then replicas execute the transactions independently with
>> no
>>> further coordination.  No SPOF.  Transactions are batched by each
>> sequencer
>>> to keep this from becoming a bottleneck.
>>>
>>> Performance: Calvin paper (published 2012) reports linear scaling of
>> TPC-C
>>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
>>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
>>> of four reads and four writes, so this is effectively 2M reads and 2M
>>> writes as we normally measure them in C*.
>>>
>>> Calvin supports mixed read/write transactions, but because the
>> transaction
>>> execution logic requires knowing all partition keys in advance to ensure
>>> that all replicas can reproduce the same results with no coordination,
>>> reads against non-PK predicates must be done ahead of time
>> (transparently,
>>> by the server) to determine the set of keys, and this must be retried if
>>> the set of rows affected is updated before the actual transaction
>> executes.
>>>
>>> Batching and global consensus adds latency -- 100ms in the Calvin paper
>> and
>>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
>>> (including multi-partition updates) are equally performant in Calvin
>> since
>>> the coordination is handled up front in the sequencing step.  Glass half
>>> empty: even single-row reads and writes have to pay the full coordination
>>> cost.  Fauna has optimized this away for reads but I am not aware of a
>>> description of how they changed the design to allow this.
>>>
>>> Functionality and limitations: since the entire transaction must be known
>>> in advance to allow coordination-less execution at the replicas, Calvin
>>> cannot support interactive transactions at all.  FaunaDB mitigates this
>> by
>>> allowing server-side logic to be included, but a Calvin approach will
>> never
>>> be able to offer SQL compatibility.
>>>
>>> Guarantees: Calvin transactions are strictly serializable.  There is no
>>> additional complexity or performance hit to generalizing to multiple
>>> regions, apart from the speed of light.  And since Calvin is already
>> paying
>>> a batching latency penalty, this is less painful than for other systems.
>>>
>>> Application to Cassandra: B-.  Distributed transactions are handled by
>> the
>>> sequencing and scheduling layers, which are leaderless, and Calvin’s
>>> requirements for the storage layer are easily met by C*.  But Calvin also
>>> requires a global consensus protocol and LWT is almost certainly not
>>> sufficiently performant, so this would require ZK or etcd (reasonable
>> for a
>>> library approach but not for replacing LWT in C* itself), or an
>>> implementation of Accord.  I don’t believe Calvin would require
>> additional
>>> table-level metadata in Cassandra.
>>>
>>> On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <bened...@apache.org>
>>> wrote:
>>>
>>>> Wiki:
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>>>> Whitepaper:
>>>>
>>>
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>>>> <
>>>>
>>>
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>>>>
>>>> Prototype: https://github.com/belliottsmith/accord
>>>>
>>>> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>>>>
>>>> Cassandra has benefitted from LWTs for many years, but application
>>>> developers that want to ensure consistency for complex operations must
>>>> either accept the scalability bottleneck of serializing all related
>> state
>>>> through a single partition, or layer a complex state machine on top of
>>> the
>>>> database. These are sophisticated and costly activities that our users
>>>> should not be expected to undertake. Since distributed databases are
>>>> beginning to offer distributed transactions with fewer caveats, it is
>>> past
>>>> time for Cassandra to do so as well.
>>>>
>>>> This CEP proposes the use of several novel techniques that build upon
>>>> research (that followed EPaxos) to deliver (non-interactive) general
>>>> purpose distributed transactions. The approach is outlined in the
>>> wikipage
>>>> and in more detail in the linked whitepaper. Importantly, by adopting
>>> this
>>>> approach we will be the _only_ distributed database to offer global,
>>>> scalable, strict serializable transactions in one wide area round-trip.
>>>> This would represent a significant improvement in the state of the art,
>>>> both in the academic literature and in commercial or open source
>>> offerings.
>>>>
>>>> This work has been partially realised in a prototype. This partial
>>>> prototype has been verified against Jepsen.io’s Maelstrom library and
>>>> dedicated in-tree strict serializability verification tools, but much
>>> work
>>>> remains for the work to be production capable and integrated into
>>> Cassandra.
>>>>
>>>> I propose including the prototype in the project as a new source
>>>> repository, to be developed as a standalone library for integration
>> into
>>>> Cassandra. I hope the community sees the important value proposition of
>>>> this proposal, and will adopt the CEP after this discussion, so that
>> the
>>>> library and its integration into Cassandra can be developed in parallel
>>> and
>>>> with the involvement of the wider community.
>>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Reply via email to