I disagree with you. However, this is the wrong forum to have a meta discussion 
about how CEP should be structured.

If you want to impose your views on CEP structure on others, please file a CEP 
with the additional restrictions and guidance you want to impose and start a 
discussion thread. I can then respond in detail to why I perceive this approach 
to be flawed, in a dedicated context.


From: Paulo Motta <pauloricard...@gmail.com>
Date: Friday, 1 October 2021 at 14:48
To: Cassandra DEV <dev@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>  The proposal as it stands today is exceptionally thorough, more so than
any other CEP to date, or any CEP is likely to be in the near future.

The protocol is thoroughly described, but in my view CEP is a forum to
discuss the high level architecture and plan for adding a full end-to-end
enhancement to the database, breaking it into sub-CEPs if needed, as long
as the full plan is known in advance, otherwise the community will not have
the context to judge the full extent and impact of the proposed enhancement.

> Since it remains unclear to me what either yourself or Jonathan want to
see as an alternative

I would personally like to see something along these lines:

CEP1: Add ACID-compliant atomic batches
- UX changes needed: none, CQL provides the grammar we need.
- Distributed transaction protocol needed: Accord (link to white paper if
you want specific details about the protcool)
- High-level architecture: what new components will be added, how existing
components will be modified, what new messages will be added, what new
configuration knobs will be introduced, what are the milestones of the
project, etc.

CEP2: Make LWT faster and more reliable
- UX changes needed: none
- Distributed transaction protocol needed: Accord, already added by
previous CEP.
- High-level architecture: blablabla... and so on.

Em sex., 1 de out. de 2021 às 10:19, bened...@apache.org <
bened...@apache.org> escreveu:

> I think this is getting circular and unproductive. Basic disagreements
> about whether the CEP specifies a feature I am inclined to leave for a
> vote. In my view the CEP specifies several features, both immediate ones
> for the user (ACID batches and multi-key LWTS) and developer-focused ones
> around ground-breaking semantics that will be enabled.
>
> The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> This is a Cassandra Enhancement *Proposal*, and at some point we have to
> engage with what is proposed, not what you might like to be proposed. Since
> it remains unclear to me what either yourself or Jonathan want to see as an
> alternative, at this point it would seem more productive to produce your
> own proposals for the community to consider. It is possible for multiple
> transaction systems to co-exist, if you feel this is necessary.
>
>
>
> From: Paulo Motta <pauloricard...@gmail.com>
> Date: Friday, 1 October 2021 at 13:58
> To: Cassandra DEV <dev@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I share similar feelings as jbellis that this proposal seems to be focusing
> on the protocol itself but lacking the actual feature that will use the
> protocol which IMO a key element to discuss on a CEP.
>
> It's similar to saying: hey I want to add this Tries Serialization Protocol
> to Cassandra, but not providing specific details of how this protocol is
> going to be used.
>
> I think the right route for a CEP is to describe the feature that will be
> added to the database and the protocol is a mere requirement of the
> high-level feature, for example:
>
> CEP: Add Trie-backed memtable
> - Trie Serialization Protocol: implementation detail of the above CEP
>
> What is the difficulty of taking this approach, picking one of the myriad
> of features that will be enabled by Accord and using that as the initial
> CEP to introduce the protocol to the database?
>
> Em sex., 1 de out. de 2021 às 08:37, bened...@apache.org <
> bened...@apache.org> escreveu:
>
> > Actually, thinking about it again, the simple optimistic protocol would
> in
> > fact guarantee system forward progress (i.e. independent of transaction
> > formulation).
> >
> >
> > From: bened...@apache.org <bened...@apache.org>
> > Date: Friday, 1 October 2021 at 09:14
> > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Hi Jonathan,
> >
> > It would be great if we could achieve a bandwidth higher than 1-2 short
> > emails per week. It remains unclear to me what your goal is, and it would
> > help if you could make a statement like “I want Cassandra to be able to
> do
> > X” so that we can respond directly to it. I am also available to have
> > another call, in which we can have a back and forth, please feel free to
> > propose a London-compatible time within the next week that is suitable
> for
> > you.
> >
> > In my opinion we are at risk of veering off-topic, though. This CEP is
> not
> > to deliver interactive transactions, and to my knowledge nobody is
> > proposing a CEP for interactive transactions. So, for the CEP at hand the
> > salient question seems: does this CEP prevent us from implementing
> > interactive transactions with properties X, Y, Z in future? To which the
> > answer is almost certainly no.
> >
> > However, to continue the discussion and respond directly to your queries,
> > I believe we agree on the definition of an interactive transaction.
> >
> > Two protocols were loosely outlined. The first, using timestamps for
> > optimistic concurrency control, would indeed involve the possibility of
> > aborts. It would not however inherently adopt the issue of LWTs where no
> > transaction is able to make progress. Whether or not progress is
> guaranteed
> > (in a livelock-free sense) would depend on the structure of the
> > transactions that were interfering.
> >
> > This approach has the advantage of being very simple to implement, so
> that
> > we could realistically support interactive transactions quite quickly. It
> > has the additional advantage that transactions would execute very quickly
> > by avoiding the WAN during construction, and as a result may in practice
> > experience fewer aborts than protocols that guarantee livelock-freedom.
> >
> > The second protocol proposed using read/write intents and would be able
> to
> > support almost any behaviour you want. We could even utilise pessimistic
> > concurrency control, or anything in-between. This is its own huge design
> > space, and discussion of this approach and the trade-offs that could be
> > made is (in my opinion) entirely out of scope for this CEP.
> >
> >
> > From: Jonathan Ellis <jbel...@gmail.com>
> > Date: Friday, 1 October 2021 at 05:00
> > To: dev <dev@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > The obstacle for me is you've provided a protocol but not a fully fleshed
> > out architecture, so it's hard to fill in some of the blanks.  But it
> looks
> > to me like optimistic concurrency control for interactive transactions
> > applied to Accord would leave you in a LWT-like situation under fairly
> > light contention where nobody actually makes progress due to retries.
> >
> > To make sure we're talking about the same thing, as Henrik pointed out,
> > interactive transactions mean multiple round trips from the client
> within a
> > transaction.  For example, here
> > <
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > >
> > is a simple implementation of the TPC-C New Order transaction.  The high
> > level logic (via
> > <
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > >)
> > is,
> >
> >    1. Get records describing a warehouse, customer, & district
> >    2. Update the district
> >    3. Increment next available order number
> >    4. Insert record into Order and New-Order tables
> >    5. For 5-15 items, get Item record, get/update Stock record
> >    6. Insert Order-Line Record
> >
> > As you can see, this requires a lot of client-side logic mixed in with
> the
> > actual SQL commands.
> >
> >
> > On Thu, Sep 30, 2021 at 2:30 AM bened...@apache.org <bened...@apache.org
> >
> > wrote:
> >
> > > Essentially this, although I think in practice we will need to track
> each
> > > partition’s timestamp separately (or optionally for reduced conflicts,
> > each
> > > row or datum’s), and make them all part of the conditional application
> of
> > > the transaction - at least for strict-serializability.
> > >
> > > The alternative is to insert read/write intents for the transaction
> > during
> > > each step, and to confirm they are still valid on commit, but this
> > approach
> > > would require a WAN round-trip for each step in the interactive
> > > transaction, whereas the timestamp-validating approach can use a LAN
> > > round-trip for each step besides the final one, and is also much
> simpler
> > to
> > > implement.
> > >
> > >
> > > From: Blake Eggleston <beggles...@apple.com.INVALID>
> > > Date: Thursday, 30 September 2021 at 05:47
> > > To: dev@cassandra.apache.org <dev@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > You could establish a lower timestamp bound and buffer transaction
> state
> > > on the coordinator, then make the commit an operation that only applies
> > if
> > > all partitions involved haven’t been changed by a more recent
> timestamp.
> > > You could also implement mvcc either in the storage layer or for some
> > > period of time by buffering commits on each replica before applying.
> > >
> > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> > > >
> > > > How are interactive transactions possible with Accord?
> > > >
> > > >
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56 PM bened...@apache.org <
> > > bened...@apache.org>
> > > > wrote:
> > > >
> > > >> Could you explain why you believe this trade-off is necessary? We
> can
> > > >> support full SQL just fine with Accord, and I hope that we
> eventually
> > > do so.
> > > >>
> > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > >> conclusions. I would invite you again to propose a system for
> > discussion
> > > >> that you think offers something Accord is unable to, and that you
> > > consider
> > > >> desirable, and we can work from there.
> > > >>
> > > >> To pre-empt some possible discussions, I am not aware of anything we
> > > >> cannot do with Accord that we could do with either Calvin or
> Spanner.
> > > >> Interactive transactions are possible on top of Accord, as are
> > > transactions
> > > >> with an unknown read/write set. In each case the only cost is that
> > they
> > > >> would use optimistic concurrency control, which is no worse the
> > spanner
> > > >> derivatives anyway (which I have to assume is your benchmark in this
> > > >> regard). I do not expect to deliver either functionality initially,
> > but
> > > >> Accord takes us most of the way there for both.
> > > >>
> > > >>
> > > >> From: Jonathan Ellis <jbel...@gmail.com>
> > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > >> To: dev <dev@cassandra.apache.org>
> > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >> Right, I'm looking for exactly a discussion on the high level goals.
> > > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > > should
> > > >> start with a discussion around, "Approach A allows X and W,
> approach B
> > > >> allows Y and Z" and decide together what the goals should be and and
> > > what
> > > >> we are willing to trade to get those goals, e.g., are we willing to
> > > give up
> > > >> global strict serializability to get the ability to support full
> SQL.
> > > Both
> > > >> of these are nice to have!
> > > >>
> > > >> On Tue, Sep 21, 2021 at 9:52 PM bened...@apache.org <
> > > bened...@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Jonathan,
> > > >>>
> > > >>> These other systems are incompatible with the goals of the CEP. I
> do
> > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> > will
> > > >>> summarise that discussion below. A true and accurate comparison of
> > > these
> > > >>> other systems is essentially intractable, as there are complex
> > > subtleties
> > > >>> to each flavour, and those who are interested would be better
> served
> > by
> > > >>> performing their own research.
> > > >>>
> > > >>> I think it is more productive to focus on what we want to achieve
> as
> > a
> > > >>> community. If you believe the goals of this CEP are wrong for the
> > > >> project,
> > > >>> let’s focus on that. If you want to compare and contrast specific
> > > facets
> > > >> of
> > > >>> alternative systems that you consider to be preferable in some
> > > dimension,
> > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > >>>
> > > >>> The relevant goals are that we:
> > > >>>
> > > >>>
> > > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > > >>>  2.  Scale to any cluster size
> > > >>>  3.  Achieve optimal latency
> > > >>>
> > > >>> The approach taken by Spanner derivatives is rejected by (1)
> because
> > > they
> > > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > > From
> > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > >>> panic-cluster-death under clock skew, this is clearly considered by
> > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > >>>
> > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> its
> > > >>> sequencing layer requires a global leader process for the cluster,
> > > which
> > > >> is
> > > >>> incompatible with Cassandra’s scalability requirements. It
> > additionally
> > > >>> fails (3) for global clients.
> > > >>>
> > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > > >>>
> > > >>> Systems such as RAMP with even weaker isolation are not considered
> > for
> > > >> the
> > > >>> simple reason that they do not even claim to meet (1).
> > > >>>
> > > >>> If we want to additionally offer weaker isolation levels than
> > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > >> Cassandra
> > > >>> is likely able to support multiple distinct transaction layers that
> > > >> operate
> > > >>> independently. I would encourage you to file a CEP to explore how
> we
> > > can
> > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > expect
> > > >>> that a majority of our user base desire strict serializable
> > isolation,
> > > >> and
> > > >>> certainly no less than serializable isolation, to augment the
> > existing
> > > >>> weaker isolation offered by quorum reads and writes.
> > > >>>
> > > >>> I would tangentially note that we are not an AP database under
> normal
> > > >>> recommended operation. A minority in any network partition cannot
> > reach
> > > >>> QUORUM, so under recommended usage we are a high-availability
> > > leaderless
> > > >> CP
> > > >>> database.
> > > >>>
> > > >>>
> > > >>> From: Jonathan Ellis <jbel...@gmail.com>
> > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > >>> To: dev <dev@cassandra.apache.org>
> > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >>> Benedict, thanks for taking the lead in putting this together.
> Since
> > > >>> Cassandra is the only relevant database today designed around a
> > > >> leaderless
> > > >>> architecture, it's quite likely that we'll be better served with a
> > > custom
> > > >>> transaction design instead of trying to retrofit one from CP
> systems.
> > > >>>
> > > >>> The whitepaper here is a good description of the consensus
> algorithm
> > > >> itself
> > > >>> as well as its robustness and stability characteristics, and its
> > > >> comparison
> > > >>> with other state-of-the-art consensus algorithms is very useful.
> In
> > > the
> > > >>> context of Cassandra, where a consensus algorithm is only part of
> > what
> > > >> will
> > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > >>> transactional side of things as well, including performance
> > > >> characteristics
> > > >>> as well as the types of transactions that can be supported and at
> > > least a
> > > >>> general idea of what it would look like applied to Cassandra. This
> > will
> > > >>> allow the PMC to make a more informed decision about what tradeoffs
> > are
> > > >>> best for the entire long-term project of first supplementing and
> > > >> ultimately
> > > >>> replacing LWT.
> > > >>>
> > > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> > same
> > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > looking
> > > >> for
> > > >>> something fast enough for occasional use but rather something
> within
> > a
> > > >>> reasonable factor of AP operations, appropriate to being the only
> way
> > > to
> > > >>> interact with tables declared as such.)
> > > >>>
> > > >>> Besides Accord, this should cover
> > > >>>
> > > >>> - Calvin and FaunaDB
> > > >>> - A Spanner derivative (no opinion on whether that should be
> > Cockroach
> > > or
> > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > suspect
> > > >>> there is more public information about MongoDB)
> > > >>> - RAMP
> > > >>>
> > > >>> Here’s an example of what I mean:
> > > >>>
> > > >>> =Calvin=
> > > >>>
> > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > order
> > > >>> transactions, then replicas execute the transactions independently
> > with
> > > >> no
> > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > >> sequencer
> > > >>> to keep this from becoming a bottleneck.
> > > >>>
> > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> of
> > > >> TPC-C
> > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > machines
> > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > composed
> > > >>> of four reads and four writes, so this is effectively 2M reads and
> 2M
> > > >>> writes as we normally measure them in C*.
> > > >>>
> > > >>> Calvin supports mixed read/write transactions, but because the
> > > >> transaction
> > > >>> execution logic requires knowing all partition keys in advance to
> > > ensure
> > > >>> that all replicas can reproduce the same results with no
> > coordination,
> > > >>> reads against non-PK predicates must be done ahead of time
> > > >> (transparently,
> > > >>> by the server) to determine the set of keys, and this must be
> retried
> > > if
> > > >>> the set of rows affected is updated before the actual transaction
> > > >> executes.
> > > >>>
> > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > paper
> > > >> and
> > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> transactions
> > > >>> (including multi-partition updates) are equally performant in
> Calvin
> > > >> since
> > > >>> the coordination is handled up front in the sequencing step.  Glass
> > > half
> > > >>> empty: even single-row reads and writes have to pay the full
> > > coordination
> > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> of
> > a
> > > >>> description of how they changed the design to allow this.
> > > >>>
> > > >>> Functionality and limitations: since the entire transaction must be
> > > known
> > > >>> in advance to allow coordination-less execution at the replicas,
> > Calvin
> > > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> > this
> > > >> by
> > > >>> allowing server-side logic to be included, but a Calvin approach
> will
> > > >> never
> > > >>> be able to offer SQL compatibility.
> > > >>>
> > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> is
> > no
> > > >>> additional complexity or performance hit to generalizing to
> multiple
> > > >>> regions, apart from the speed of light.  And since Calvin is
> already
> > > >> paying
> > > >>> a batching latency penalty, this is less painful than for other
> > > systems.
> > > >>>
> > > >>> Application to Cassandra: B-.  Distributed transactions are handled
> > by
> > > >> the
> > > >>> sequencing and scheduling layers, which are leaderless, and
> Calvin’s
> > > >>> requirements for the storage layer are easily met by C*.  But
> Calvin
> > > also
> > > >>> requires a global consensus protocol and LWT is almost certainly
> not
> > > >>> sufficiently performant, so this would require ZK or etcd
> (reasonable
> > > >> for a
> > > >>> library approach but not for replacing LWT in C* itself), or an
> > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > >> additional
> > > >>> table-level metadata in Cassandra.
> > > >>>
> > > >>> On Sun, Sep 5, 2021 at 9:33 AM bened...@apache.org <
> > > bened...@apache.org>
> > > >>> wrote:
> > > >>>
> > > >>>> Wiki:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > >>>> Whitepaper:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > >>>> <
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >>>>>
> > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > >>>>
> > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > >> community.
> > > >>>>
> > > >>>> Cassandra has benefitted from LWTs for many years, but application
> > > >>>> developers that want to ensure consistency for complex operations
> > must
> > > >>>> either accept the scalability bottleneck of serializing all
> related
> > > >> state
> > > >>>> through a single partition, or layer a complex state machine on
> top
> > of
> > > >>> the
> > > >>>> database. These are sophisticated and costly activities that our
> > users
> > > >>>> should not be expected to undertake. Since distributed databases
> are
> > > >>>> beginning to offer distributed transactions with fewer caveats, it
> > is
> > > >>> past
> > > >>>> time for Cassandra to do so as well.
> > > >>>>
> > > >>>> This CEP proposes the use of several novel techniques that build
> > upon
> > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > > >>>> purpose distributed transactions. The approach is outlined in the
> > > >>> wikipage
> > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > adopting
> > > >>> this
> > > >>>> approach we will be the _only_ distributed database to offer
> global,
> > > >>>> scalable, strict serializable transactions in one wide area
> > > round-trip.
> > > >>>> This would represent a significant improvement in the state of the
> > > art,
> > > >>>> both in the academic literature and in commercial or open source
> > > >>> offerings.
> > > >>>>
> > > >>>> This work has been partially realised in a prototype. This partial
> > > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> > and
> > > >>>> dedicated in-tree strict serializability verification tools, but
> > much
> > > >>> work
> > > >>>> remains for the work to be production capable and integrated into
> > > >>> Cassandra.
> > > >>>>
> > > >>>> I propose including the prototype in the project as a new source
> > > >>>> repository, to be developed as a standalone library for
> integration
> > > >> into
> > > >>>> Cassandra. I hope the community sees the important value
> proposition
> > > of
> > > >>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > > >> the
> > > >>>> library and its integration into Cassandra can be developed in
> > > parallel
> > > >>> and
> > > >>>> with the involvement of the wider community.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Jonathan Ellis
> > > >>> co-founder, http://www.datastax.com
> > > >>> @spyced
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Jonathan Ellis
> > > >> co-founder, http://www.datastax.com
> > > >> @spyced
> > > >>
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: dev-h...@cassandra.apache.org
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>

Reply via email to