Re: [DISCUSS] CEP-13: Denylisting partitions

2021-09-07 Thread Sumanth Pasupuleti
Resolved comment discussions in the design document that are closed.
If there is no further feedback, I will start a voting thread tomorrow.

Thanks,
Sumanth

On Thu, Sep 2, 2021 at 6:54 AM Joshua McKenzie  wrote:

> I'm +1 on where it currently stands after the revisions. Consider resolving
> out comment threads on the design doc that are closed so we can see if
> there's any outstanding discussions from a high level?
>
> ~Josh
>
> On Mon, Aug 30, 2021 at 1:14 AM Sumanth Pasupuleti <
> sumanth.pasupuleti...@gmail.com> wrote:
>
> > +1. Made changes to the design document linked against the CEP to reflect
> > this feedback. Specifically, the following sections have been updated
> > * Operations to blacklist
> > * Blacklist information store
> >
> > Thanks,
> > Sumanth
> >
> >
> > On Fri, Aug 27, 2021 at 7:57 AM Joshua McKenzie 
> > wrote:
> >
> > > I can see the case for all three:
> > > * Deny both reads and writes to a partition (wide, heavily tombstones,
> > too
> > > many stables, etc) causing disruption to a replica set; don't want
> > further
> > > growth nor reads until operator intervention
> > > * Deny reads but allow writing to rectify problems on a partition
> > > (intervention window; see above)
> > > * Deny writes to a partition but allow reads (prevent partitions
> growing
> > > unbounded, or potentially evolving into a future feature creating a
> > ceiling
> > > on partition sizes that kicks in and demands application intervention
> to
> > > reduce partition size at a guardrail limit)
> > >
> > > So yeah, at least to me at face value it seems like it'd be worth it
> not
> > > only to allow denylisting both reads and writes, but to be able to
> choose
> > > from the set of reads|writes|both on a per-partition basis.
> > >
> > > ~Josh
> > >
> > >
> > > On Thu, Aug 26, 2021 at 2:16 PM Sumanth Pasupuleti <
> > > sumanth.pasupuleti...@gmail.com> wrote:
> > >
> > > > Thank you, Josh for the elaborate explanation of a potential scenario
> > > where
> > > > denylisting writes would make sense.
> > > > I, 100% agree that could benefit in a situation where we would want
> to
> > > deny
> > > > writes to a partition that we do not have much control on (which is
> > true
> > > in
> > > > most situations) and such behavior can eventually lead to
> > unavailability
> > > of
> > > > other partitions too, as you indicate.
> > > >
> > > > Do you think it makes sense to make it configurable per partition
> > though?
> > > > As in, maybe by default, we would want to deny both reads and writes
> > to a
> > > > partition, but for certain partitions, we may still want to allow
> > writes
> > > > just so we can issue a delete against that partition as an example.
> > > > Ofcourse this would make the feature and the interface more heavy,
> and
> > we
> > > > need to think through if its worth it. I personally feel it could be
> > > worth
> > > > it, especially if we agree on the default behavior that makes the
> > > interface
> > > > simple in most cases. Thoughts?
> > > >
> > > > And yes, so good to see CEP process reaping benefits in multiple
> ways -
> > > > especially around collaboration and documentation.
> > > >
> > > >
> > > > On Thu, Aug 26, 2021 at 8:31 AM Joshua McKenzie <
> jmcken...@apache.org>
> > > > wrote:
> > > >
> > > > > The design doc and CEP currently pass on blocklisting / denylisting
> > > > writes
> > > > > at this time. In the proposed new patch it states:
> > > > > "Note: We do not want to blacklist writes since it is the reads
> that
> > > > > primarily impact the performance when reading a bad partition, and
> we
> > > may
> > > > > want writes to be allowed to “fix” a bad partition. We could
> revisit
> > > this
> > > > > in the future"
> > > > >
> > > > > In situations where you have an air gap between database ops and
> > > > > application access (ops <> application teams, or more autonomous
> > > > > application access patterns, self-service, etc), you can easily get
> > > into
> > > > a
> > > > > situation where you have either a pathological client hammering
> > writes
> > > > to a
> > > > > specific partition causing impact to other clients or in the worst
> > > case,
> > > > > the replica set, or unbounded partition growth that again leads to
> > > > > performance degradation or replica set unavailability. The tradeoff
> > > there
> > > > > becomes "do we interrupt the application's ability to write to this
> > > > > partition now, or do we instead defer and risk losing access to
> *all*
> > > > > partitions on this replica set and still interrupt their access
> > > > eventually
> > > > > anyway?"
> > > > >
> > > > > Given this, I strongly advocate for support of denylisting both
> reads
> > > > *and*
> > > > > writes on these grounds; operators need another tool in their
> toolbox
> > > to
> > > > > deal with situations where specific partition writing has wider
> > > negative
> > > > > impacts on the replicas.
> > > > >
> > > > > Acknowledging of course that there was extensive 

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread bened...@apache.org
Hi Jake,

> What structural changes are planned to support an external dependency project 
> like this

To add to Blake’s answer, in case there’s some confusion over this, the 
proposal is to include this library within the Apache Cassandra project. So I 
wouldn’t think of it as an external dependency. This PMC and community will 
still have the usual oversight over direction and development, and APIs will be 
developed solely with the intention of their integration with Cassandra.

> Will this effort eventually replace consistency levels in C*?

I hope we’ll have some very related discussions around consistency levels in 
the coming months more generally, but I don’t think that is tightly coupled to 
this work. I agree with you both that we won’t want to perpetuate the problems 
you’ve highlighted though.

Henrik:
> I was referring to the property that Calvin transactions also need to be sent 
> to the cluster in a single shot

Ah, yes. In that case I agree, and I tried to point to this direction in an 
earlier email, where I discussed the use of scripting languages (i.e. 
transactionally modifying the database with some subset of arbitrary 
computation). I think the JVM is particularly suited to offering quite powerful 
distributed transactions in this vein, and it will be interesting to see what 
we might develop in this direction in future.


From: Jake Luciani 
Date: Tuesday, 7 September 2021 at 19:27
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Great thanks for the information

On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
 wrote:

> Hi Jake,
>
> > 1.  Will this effort eventually replace consistency levels in C*?  I ask
> > because one of the shortcomings of our paxos today is
> > it can be easily mixed with non serialized consistencies and therefore
> > users commonly break consistency by for example reading at CL.ONE while
> > also
> > using LWTs.
>
> This will likely require CLs to be specified at the schema level for
> tables using multi partition transactions. I’d expect this to be available
> for other tables, but not required.
>
> > 2. What structural changes are planned to support an external dependency
> > project like this?  Are there some high level interfaces you expect the
> > project to adhere to?
>
> There will be some interfaces that need to be implemented in C* to support
> the library. You can find the current interfaces in the accord.api package,
> but these were written to support some initial testing, and not intended
> for integration into C* as is. Things are pretty fluid right now and will
> be rewritten / refactored multiple times over the next few months.
>
> Thanks,
>
> Blake
>
>
> > On Sun, Sep 5, 2021 at 10:33 AM bened...@apache.org  >
> > wrote:
> >
> >> Wiki:
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >> Whitepaper:
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >> <
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>
> >> Prototype: https://github.com/belliottsmith/accord
> >>
> >> Hi everyone, I’d like to propose this CEP for adoption by the community.
> >>
> >> Cassandra has benefitted from LWTs for many years, but application
> >> developers that want to ensure consistency for complex operations must
> >> either accept the scalability bottleneck of serializing all related
> state
> >> through a single partition, or layer a complex state machine on top of
> the
> >> database. These are sophisticated and costly activities that our users
> >> should not be expected to undertake. Since distributed databases are
> >> beginning to offer distributed transactions with fewer caveats, it is
> past
> >> time for Cassandra to do so as well.
> >>
> >> This CEP proposes the use of several novel techniques that build upon
> >> research (that followed EPaxos) to deliver (non-interactive) general
> >> purpose distributed transactions. The approach is outlined in the
> wikipage
> >> and in more detail in the linked whitepaper. Importantly, by adopting
> this
> >> approach we will be the _only_ distributed database to offer global,
> >> scalable, strict serializable transactions in one wide area round-trip.
> >> This would represent a significant improvement in the state of the art,
> >> both in the academic literature and in commercial or open source
> offerings.
> >>
> >> This work has been partially realised in a prototype. This partial
> >> prototype has been verified against Jepsen.io’s Maelstrom library and
> >> dedicated in-tree strict serializability verification tools, but much
> work
> >> remains for the work to be production capable and integrated into
> Cassandra.
> >>
> >> I propose including the prototype in the project as a new source
> >> repository, to be developed as a standalone library for integration into
> >> Cassandra. I hope the com

[RELEASE] Apache Cassandra 4.0.1 released

2021-09-07 Thread Sam Tunnicliffe


The Cassandra team is pleased to announce the release of Apache Cassandra 
version 4.0.1.

Apache Cassandra is a fully distributed database. It is the right choice when 
you need scalability and high availability without compromising performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 4.0 series. As always, please pay 
attention to the release notes[2] and let us know[3] if you were to encounter 
any problem.

Enjoy!

[1]: CHANGES.txt 
https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0.1
[2]: NEWS.txt 
https://gitbox.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0.1
[3]: https://issues.apache.org/jira/browse/CASSANDRA


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Jake Luciani
Great thanks for the information

On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
 wrote:

> Hi Jake,
>
> > 1.  Will this effort eventually replace consistency levels in C*?  I ask
> > because one of the shortcomings of our paxos today is
> > it can be easily mixed with non serialized consistencies and therefore
> > users commonly break consistency by for example reading at CL.ONE while
> > also
> > using LWTs.
>
> This will likely require CLs to be specified at the schema level for
> tables using multi partition transactions. I’d expect this to be available
> for other tables, but not required.
>
> > 2. What structural changes are planned to support an external dependency
> > project like this?  Are there some high level interfaces you expect the
> > project to adhere to?
>
> There will be some interfaces that need to be implemented in C* to support
> the library. You can find the current interfaces in the accord.api package,
> but these were written to support some initial testing, and not intended
> for integration into C* as is. Things are pretty fluid right now and will
> be rewritten / refactored multiple times over the next few months.
>
> Thanks,
>
> Blake
>
>
> > On Sun, Sep 5, 2021 at 10:33 AM bened...@apache.org  >
> > wrote:
> >
> >> Wiki:
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >> Whitepaper:
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >> <
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>
> >> Prototype: https://github.com/belliottsmith/accord
> >>
> >> Hi everyone, I’d like to propose this CEP for adoption by the community.
> >>
> >> Cassandra has benefitted from LWTs for many years, but application
> >> developers that want to ensure consistency for complex operations must
> >> either accept the scalability bottleneck of serializing all related
> state
> >> through a single partition, or layer a complex state machine on top of
> the
> >> database. These are sophisticated and costly activities that our users
> >> should not be expected to undertake. Since distributed databases are
> >> beginning to offer distributed transactions with fewer caveats, it is
> past
> >> time for Cassandra to do so as well.
> >>
> >> This CEP proposes the use of several novel techniques that build upon
> >> research (that followed EPaxos) to deliver (non-interactive) general
> >> purpose distributed transactions. The approach is outlined in the
> wikipage
> >> and in more detail in the linked whitepaper. Importantly, by adopting
> this
> >> approach we will be the _only_ distributed database to offer global,
> >> scalable, strict serializable transactions in one wide area round-trip.
> >> This would represent a significant improvement in the state of the art,
> >> both in the academic literature and in commercial or open source
> offerings.
> >>
> >> This work has been partially realised in a prototype. This partial
> >> prototype has been verified against Jepsen.io’s Maelstrom library and
> >> dedicated in-tree strict serializability verification tools, but much
> work
> >> remains for the work to be production capable and integrated into
> Cassandra.
> >>
> >> I propose including the prototype in the project as a new source
> >> repository, to be developed as a standalone library for integration into
> >> Cassandra. I hope the community sees the important value proposition of
> >> this proposal, and will adopt the CEP after this discussion, so that the
> >> library and its integration into Cassandra can be developed in parallel
> and
> >> with the involvement of the wider community.
> >>
> >
> >
> > --
> > http://twitter.com/tjake
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
>
>

-- 
http://twitter.com/tjake


Re: [DISCUSS] CEP-7 Storage Attached Index

2021-09-07 Thread Caleb Rackliffe
So this thread stalled almost a year ago. (Wow, time flies when you're
trying to release 4.0.) My synthesis of the conversation to this point is
that while there are some open questions about testing
methodology/"definition of done" and our choice of particular on-disk data
structures, neither of these should be a serious obstacle to moving forward
w/ a vote. Having said that, is there anything left around the CEP that we
feel should prevent it from moving to a vote?

In terms of how we would proceed from the point a vote passes, it seems
like there have been enough concerns around the proposed/necessary breaking
changes to the 2i API, that we will start development by introducing
components as incrementally as possible into a long-running feature branch
off trunk. (This work would likely start w/ *CASSANDRA-16092*
, which we could
resolve as a sub-task of the SAI epic without interfering with other trunk
development likely destined for a 4.x minor, etc.)

On Thu, Sep 24, 2020 at 2:47 AM Jasonstack Zhao Yang <
jasonstack.z...@gmail.com> wrote:

> >> Question is: is this planned as a next step?
> >> If yes, how are we going to mark SAI as experimental until it gets
> >> row offsets? Also, it is likely that index format is going to change
> when
> >> row offsets are added, so my concern is that we may have to support two
> >> versions of a format for a smooth migration.
>
> The goal is to support row-level index when merging SAI, I will update the
> CEP about it.
>
> >> I think switching to row
> >> offsets also has a huge impact on interaction with SPRC and has some
> >> potential for optimisations.
>
> Can you share more details on the optimizations?
>
>
>
> On Thu, 24 Sep 2020 at 15:20, Oleksandr Petrov  >
> wrote:
>
> > > But for improving overall index read performance, I think improving
> base
> > table read perf  (because SAI/SASI executes LOTS of
> > SinglePartitionReadCommand after searching on-disk index) is more
> effective
> > than switching from Trie to Prefix BTree.
> >
> > I haven't suggested switching to Prefix B-Tree or any other structure,
> the
> > question was about rationale and motivation of picking one over the
> other,
> > which I am curious about for personal reasons/interests that lie outside
> of
> > Cassandra. Having this listed in CEP could have been helpful for future
> > guidance. It's ok if this question is outside of the CEP scope.
> >
> > I also agree that there are many areas that require improvement around
> the
> > read/write path and 2i, many of which (even outside of base table format
> or
> > read perf) can yield positive performance results.
> >
> > > FWIW, I personally look forward to receiving that contribution when the
> > time is right.
> >
> > I am very excited for this contribution, too, and it looks like very
> solid
> > work.
> >
> > I have one more question, about "Upon resolving partition keys, rows are
> > loaded using Cassandra’s internal partition read command across SSTables
> > and are post filtered". One of the criticisms of SASI and reasons for
> > marking it as experimental was CASSANDRA-11990. I think switching to row
> > offsets also has a huge impact on interaction with SPRC and has some
> > potential for optimisations. Question is: is this planned as a next step?
> > If yes, how are we going to mark SAI as experimental until it gets
> > row offsets? Also, it is likely that index format is going to change when
> > row offsets are added, so my concern is that we may have to support two
> > versions of a format for a smooth migration.
> >
> >
> >
> > On Thu, Sep 24, 2020 at 6:53 AM Jasonstack Zhao Yang <
> > jasonstack.z...@gmail.com> wrote:
> >
> > > >> I think CEP should be more upfront with "eventually replace
> > > >>  it" bit, since it raises the question about what the people who are
> > > using
> > > >> other index implementations can expect.
> > >
> > > Will update the CEP to emphasize: SAI will replace other indexes.
> > >
> > > >> Unfortunately, I do not have an
> > > >> implementation sitting around for a direct comparison, but I can
> > imagine
> > > >> situations when B-Trees may perform better because of simpler
> > > construction.
> > > >> Maybe we should even consider prototyping a prefix B-Tree to have a
> > more
> > > >> fair comparison.
> > >
> > > As long as prefix BTree supports range/prefix aggregation (which is
> used
> > to
> > > speed up
> > > range/prefix query when matching entire subtree), we can plug it in and
> > > compare. It won't
> > > affect the CEP design which focuses on sharing data across indexes and
> > > posting aggregation.
> > >
> > > But for improving overall index read performance, I think improving
> base
> > > table read perf
> > >  (because SAI/SASI executes LOTS of SinglePartitionReadCommand after
> > > searching on-disk index)
> > > is more effective than switching from Trie to Prefix BTree.
> > >
> > >
> > >
> > > On Thu, 24 Sep 2020 at 05:33, Be

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Blake Eggleston
Hi Jake,

> 1.  Will this effort eventually replace consistency levels in C*?  I ask
> because one of the shortcomings of our paxos today is
> it can be easily mixed with non serialized consistencies and therefore
> users commonly break consistency by for example reading at CL.ONE while
> also
> using LWTs.

This will likely require CLs to be specified at the schema level for tables 
using multi partition transactions. I’d expect this to be available for other 
tables, but not required.

> 2. What structural changes are planned to support an external dependency
> project like this?  Are there some high level interfaces you expect the
> project to adhere to?

There will be some interfaces that need to be implemented in C* to support the 
library. You can find the current interfaces in the accord.api package, but 
these were written to support some initial testing, and not intended for 
integration into C* as is. Things are pretty fluid right now and will be 
rewritten / refactored multiple times over the next few months.

Thanks,

Blake


> On Sun, Sep 5, 2021 at 10:33 AM bened...@apache.org 
> wrote:
> 
>> Wiki:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>> Whitepaper:
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>> <
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>> 
>> Prototype: https://github.com/belliottsmith/accord
>> 
>> Hi everyone, I’d like to propose this CEP for adoption by the community.
>> 
>> Cassandra has benefitted from LWTs for many years, but application
>> developers that want to ensure consistency for complex operations must
>> either accept the scalability bottleneck of serializing all related state
>> through a single partition, or layer a complex state machine on top of the
>> database. These are sophisticated and costly activities that our users
>> should not be expected to undertake. Since distributed databases are
>> beginning to offer distributed transactions with fewer caveats, it is past
>> time for Cassandra to do so as well.
>> 
>> This CEP proposes the use of several novel techniques that build upon
>> research (that followed EPaxos) to deliver (non-interactive) general
>> purpose distributed transactions. The approach is outlined in the wikipage
>> and in more detail in the linked whitepaper. Importantly, by adopting this
>> approach we will be the _only_ distributed database to offer global,
>> scalable, strict serializable transactions in one wide area round-trip.
>> This would represent a significant improvement in the state of the art,
>> both in the academic literature and in commercial or open source offerings.
>> 
>> This work has been partially realised in a prototype. This partial
>> prototype has been verified against Jepsen.io’s Maelstrom library and
>> dedicated in-tree strict serializability verification tools, but much work
>> remains for the work to be production capable and integrated into Cassandra.
>> 
>> I propose including the prototype in the project as a new source
>> repository, to be developed as a standalone library for integration into
>> Cassandra. I hope the community sees the important value proposition of
>> this proposal, and will adopt the CEP after this discussion, so that the
>> library and its integration into Cassandra can be developed in parallel and
>> with the involvement of the wider community.
>> 
> 
> 
> -- 
> http://twitter.com/tjake


-
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org



Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Jake Luciani
Hi Benedict!

I haven't gone too deeply into this proposal but it's very exciting to see
this kind of innovation!

Some basic questions which are tangentially related with this effort I
didn't see covered in the CEP.

1.  Will this effort eventually replace consistency levels in C*?  I ask
because one of the shortcomings of our paxos today is
it can be easily mixed with non serialized consistencies and therefore
users commonly break consistency by for example reading at CL.ONE while
also
using LWTs.

2. What structural changes are planned to support an external dependency
project like this?  Are there some high level interfaces you expect the
project to adhere to?

Thanks
Jake




On Sun, Sep 5, 2021 at 10:33 AM bened...@apache.org 
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>


-- 
http://twitter.com/tjake


Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Henrik Ingo
On Tue, Sep 7, 2021 at 5:06 PM bened...@apache.org 
wrote:

> > I was thinking that a path similar to Calvin/FaunaDB is certainly
> looming in the horizon at least.
>
> I’m not sure which aspect of these systems you are referring to. Unless I
> have misunderstood, I consider them to be strictly inferior approaches
> (particularly for Cassandra) as they require a _global_ leader process and
> as a result have scalability limits. Users simply shift the sharding
> problem to the cluster level rather than the node level, but the
> fundamental problem remains. This may be acceptable for many users, but was
> contrary to the goals of this CEP.
>

Oh yes. For sure it's one of the strengths of the CEP that it is clearly
designed to fit well into the existing Cassandra architecture and
experience.

I was referring to the property that Calvin transactions also need to be
sent to the cluster in a single shot, but then they have extended the
functionality by allowing programming logic to be executed inside the
transaction. (Like a stored procedure, if you will.) So the transactions
can be multi-statement with complex logic, they just can't communicate
outside the cluster - such as back and forth with the client and server.


> > good job pulling together ingredients from state of the art work in this
> area
>
> In case this was lost in the noise: this work is not simply an assembly of
> prior work. It introduces entirely novel approaches that permit the work to
> exceed the capabilities of any prior research or production system. It is
> worth properly highlighting that if we deliver this, Cassandra will have
> the most sophisticated transaction system full stop.
>
>
Of course. Maybe it's just me, but I'm at least equally impressed by the
"level of education" the authors show in not reinventing the wheel for the
details where copying a feature, or at least being inspired by one, from
some existing publication or implementation was possible. Knowing what to
keep vs what you want to improve isn't easy. Also, it makes the whitepaper
an interesting read when in addition to learning about Accord I also
learned about several other systems that I hadn't previously read about.

henrik


Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread bened...@apache.org
> I was thinking that a path similar to Calvin/FaunaDB is certainly looming in 
> the horizon at least.

I’m not sure which aspect of these systems you are referring to. Unless I have 
misunderstood, I consider them to be strictly inferior approaches (particularly 
for Cassandra) as they require a _global_ leader process and as a result have 
scalability limits. Users simply shift the sharding problem to the cluster 
level rather than the node level, but the fundamental problem remains. This may 
be acceptable for many users, but was contrary to the goals of this CEP.

> It seems to me at that point long running queries and interactive 
> transactions are mostly the same problem.

I would estimate long running queries to be easier to deliver by at least an 
order of magnitude. They’re not unrelated, but they’re still quite distinct in 
my opinion.

> good job pulling together ingredients from state of the art work in this area

In case this was lost in the noise: this work is not simply an assembly of 
prior work. It introduces entirely novel approaches that permit the work to 
exceed the capabilities of any prior research or production system. It is worth 
properly highlighting that if we deliver this, Cassandra will have the most 
sophisticated transaction system full stop.

There are to my knowledge no databases offering distributed transactions that 
are both strict serializable and have no scalability bottleneck. Every database 
today clearly aims for this combination, but accepts some trade-off: either 
only guaranteeing serializable isolation, requiring special time keeping 
hardware to guarantee strict serializability, or using a global leader process 
(or uses two phase commit, but this is quite niche).



From: Henrik Ingo 
Date: Tuesday, 7 September 2021 at 14:06
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Tue, Sep 7, 2021 at 12:26 PM bened...@apache.org 
wrote:

> > whether I should just* think of this as "better and more efficient LWT”
>
> So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon
> definition. My understanding of a core feature/limitation of LWTs is that
> they operate over a single partition, and as a result many operations are
> impossible even in multiple rounds without complex distributed state
> machines. The core improvement here, besides improved performance, is that
> we will be able to operate over any set of keys at-once.
>
>
My bad, I have never used LWT and forgot / didn't know they were single
partition. The CEP makes more sense now.



> How this facility is evolved into user-facing capabilities is an
> open-ended question. Initially of course we will at least support the same
> syntax but remove the restriction on operating over a single partition. I
> haven’t thought about this much, as the CEP is primarily for enabling
> works, but I think we will want to expand the syntax in two ways:
>
>  1) to support more complex conditions (simple AND conditions across all
> partitions seem likely too restrictive, though they might make sense for
> the single partition case);
>   2) to support inserting data from one row into another, potentially with
> transformations being applied (including via UDFs).
>
> These are both relatively manageable improvements that we might want to
> land in the same major release as the transactions themselves. The core
> facility can be expanded quite broadly, though. It would be possible for
> instance to support some interpreted language(s) as part of a query, so
> that arbitrary work can be applied in the transaction.
>

I was thinking that a path similar to Calvin/FaunaDB is certainly looming
in the horizon at least. I've been following those with interest, because
a) it's refreshingly outside of the box thinking, and b) they seem to be
able to push the limitations of this approach much beyond what one might
imagine when reading about it the first time. But like you also point out,
it remains to be seen whether users actually want those kinds of
transactions. We are creatures of habit for sure.



> Or, perhaps the community would rather build atop the feature to support
> interactive transactions at the client. I can’t predict resourcing for
> this, though, and it might be a community effort. I think it would be quite
> tractable once this work lands, however.
>
> > Suppose I wanted to do a long running read-only transaction
>
> So, there’s two sides to this: with and without paging. A long running
> read-only transaction taking a few seconds is quite likely to be fine and
> we will probably support with some MVCC within the transaction system
> itself. This may or may not be part of v1, it’s hard to predict with
> certainty as this is going to be a large undertaking.
>
> But for paged queries we’d be talking about SNAPSHOT isolation. This is
> likely to be something the community wants to support before long anyway
> and is probably not as hard as you might think. It is pr

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Henrik Ingo
On Tue, Sep 7, 2021 at 12:26 PM bened...@apache.org 
wrote:

> > whether I should just* think of this as "better and more efficient LWT”
>
> So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon
> definition. My understanding of a core feature/limitation of LWTs is that
> they operate over a single partition, and as a result many operations are
> impossible even in multiple rounds without complex distributed state
> machines. The core improvement here, besides improved performance, is that
> we will be able to operate over any set of keys at-once.
>
>
My bad, I have never used LWT and forgot / didn't know they were single
partition. The CEP makes more sense now.



> How this facility is evolved into user-facing capabilities is an
> open-ended question. Initially of course we will at least support the same
> syntax but remove the restriction on operating over a single partition. I
> haven’t thought about this much, as the CEP is primarily for enabling
> works, but I think we will want to expand the syntax in two ways:
>
>  1) to support more complex conditions (simple AND conditions across all
> partitions seem likely too restrictive, though they might make sense for
> the single partition case);
>   2) to support inserting data from one row into another, potentially with
> transformations being applied (including via UDFs).
>
> These are both relatively manageable improvements that we might want to
> land in the same major release as the transactions themselves. The core
> facility can be expanded quite broadly, though. It would be possible for
> instance to support some interpreted language(s) as part of a query, so
> that arbitrary work can be applied in the transaction.
>

I was thinking that a path similar to Calvin/FaunaDB is certainly looming
in the horizon at least. I've been following those with interest, because
a) it's refreshingly outside of the box thinking, and b) they seem to be
able to push the limitations of this approach much beyond what one might
imagine when reading about it the first time. But like you also point out,
it remains to be seen whether users actually want those kinds of
transactions. We are creatures of habit for sure.



> Or, perhaps the community would rather build atop the feature to support
> interactive transactions at the client. I can’t predict resourcing for
> this, though, and it might be a community effort. I think it would be quite
> tractable once this work lands, however.
>
> > Suppose I wanted to do a long running read-only transaction
>
> So, there’s two sides to this: with and without paging. A long running
> read-only transaction taking a few seconds is quite likely to be fine and
> we will probably support with some MVCC within the transaction system
> itself. This may or may not be part of v1, it’s hard to predict with
> certainty as this is going to be a large undertaking.
>
> But for paged queries we’d be talking about SNAPSHOT isolation. This is
> likely to be something the community wants to support before long anyway
> and is probably not as hard as you might think. It is probably outside of
> the scope of this work, though the two would dovetail very nicely.
>

I've pointed out to some of my colleagues that since Cassandra's storage
engine is an LSM engine, with some additional work it could become an MVCC
style storage engine. Your thinking here seems to be in the same direction,
even if it's beyond version 1. (Just for context, also for benefit of other
readers on the list, it took MongoDB 5 years and 6 major releases to
develop distributed multi-shard transactions. So it's good to talk about
the general direction, but understanding that this is not something anyone
will finish before Christmas.)

It seems to me at that point long running queries and interactive
transactions are mostly the same problem.



Benedict, thanks for the answers. Since I'm not a Cassandra developer I
feel it would be inappropriate for me to express an opinion for or against,
so I'll just end with saying this is an interesting proposal and the
authors have done a good job pulling together ingredients from state of the
art work in this area. As such it will be interesting to follow the
discussion and work from whitepaper to implementation.


A secondary objective was also to just let everyone know I am lurking here.
If you ever want to reach out for an off-band discussion, you now have my
contact details.

henrik


Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread bened...@apache.org
> Sorry if a few comments were a bit "editorial" in the first message

Not a problem at all – more than happy to talk about suggestions in that vein! 
Just probably best not to subject everyone else to the discussion.

> What I would like to understand better and without guessing is, what do these 
> transactions look like from a client/user point of view?

This is a fair question, and perhaps something I should pinpoint more directly 
for the reader. The CEP does stipulate non-interactive transactions, i.e. those 
that are one-shot. The only other limitation is that the partition keys must be 
known upfront, however I expect we will follow-up soon after with some weaker 
semantics that build on top (probably using optimistic concurrency control) to 
support transactions where only some partition keys are known upfront, so that 
we may support global secondary indexes with proper isolation and consistency.

> whether I should just* think of this as "better and more efficient LWT”

So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon 
definition. My understanding of a core feature/limitation of LWTs is that they 
operate over a single partition, and as a result many operations are impossible 
even in multiple rounds without complex distributed state machines. The core 
improvement here, besides improved performance, is that we will be able to 
operate over any set of keys at-once.

How this facility is evolved into user-facing capabilities is an open-ended 
question. Initially of course we will at least support the same syntax but 
remove the restriction on operating over a single partition. I haven’t thought 
about this much, as the CEP is primarily for enabling works, but I think we 
will want to expand the syntax in two ways:

 1) to support more complex conditions (simple AND conditions across all 
partitions seem likely too restrictive, though they might make sense for the 
single partition case);
  2) to support inserting data from one row into another, potentially with 
transformations being applied (including via UDFs).

These are both relatively manageable improvements that we might want to land in 
the same major release as the transactions themselves. The core facility can be 
expanded quite broadly, though. It would be possible for instance to support 
some interpreted language(s) as part of a query, so that arbitrary work can be 
applied in the transaction.

Or, perhaps the community would rather build atop the feature to support 
interactive transactions at the client. I can’t predict resourcing for this, 
though, and it might be a community effort. I think it would be quite tractable 
once this work lands, however.

> Suppose I wanted to do a long running read-only transaction

So, there’s two sides to this: with and without paging. A long running 
read-only transaction taking a few seconds is quite likely to be fine and we 
will probably support with some MVCC within the transaction system itself. This 
may or may not be part of v1, it’s hard to predict with certainty as this is 
going to be a large undertaking.

But for paged queries we’d be talking about SNAPSHOT isolation. This is likely 
to be something the community wants to support before long anyway and is 
probably not as hard as you might think. It is probably outside of the scope of 
this work, though the two would dovetail very nicely.


From: Henrik Ingo 
Date: Tuesday, 7 September 2021 at 09:24
To: dev@cassandra.apache.org 
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Tue, Sep 7, 2021 at 1:31 AM bened...@apache.org 
wrote:

>
> Of course, but we may have to be selective in our back-and-forth. We can
> always take some discussion off-list to keep it manageable.
>
>
I'll try to converge.Sorry if a few comments were a bit "editorial" in the
first message. I find that sometimes it pays off to also ask the dumb
questions, as long as we don't get stuck on any of them.


> > The algorithm is hard to read since you omit the roles of the
> participants.
>
> Thanks. I will consider how I might make it clearer that the portions of
> the algorithm that execute on receipt of messages that may only be received
> by replicas, are indeed executed by those replicas.
>
>
In fact the same algorithm in the CEP was easier to read exactly because of
this, I now realize.


> > So I guess my question is how and when reads happen?
>
> I think this is reasonably well specified in the protocol and, since it’s
> unclear what you’ve found confusing, I don’t know it would be productive to
> try to explain it again here on list. You can look at the prototype, if
> Java is easier for you to parse, as it is of course fully specified there
> with no ambiguity. Or we can discuss off list, or perhaps on the community
> slack channel.
>
>
Maybe my question was a bit too open ended, as I didn't want to lead into
any specific direction.

I can of course tell where reads happen in the execution algorithm. What I
would like to unders

Re: [DISCUSS] CEP-15: General Purpose Transactions

2021-09-07 Thread Henrik Ingo
On Tue, Sep 7, 2021 at 1:31 AM bened...@apache.org 
wrote:

>
> Of course, but we may have to be selective in our back-and-forth. We can
> always take some discussion off-list to keep it manageable.
>
>
I'll try to converge.Sorry if a few comments were a bit "editorial" in the
first message. I find that sometimes it pays off to also ask the dumb
questions, as long as we don't get stuck on any of them.


> > The algorithm is hard to read since you omit the roles of the
> participants.
>
> Thanks. I will consider how I might make it clearer that the portions of
> the algorithm that execute on receipt of messages that may only be received
> by replicas, are indeed executed by those replicas.
>
>
In fact the same algorithm in the CEP was easier to read exactly because of
this, I now realize.


> > So I guess my question is how and when reads happen?
>
> I think this is reasonably well specified in the protocol and, since it’s
> unclear what you’ve found confusing, I don’t know it would be productive to
> try to explain it again here on list. You can look at the prototype, if
> Java is easier for you to parse, as it is of course fully specified there
> with no ambiguity. Or we can discuss off list, or perhaps on the community
> slack channel.
>
>
Maybe my question was a bit too open ended, as I didn't want to lead into
any specific direction.

I can of course tell where reads happen in the execution algorithm. What I
would like to understand better and without guessing is, what do these
transactions look like from a client/user point of view? You already
confirmed that interactive transactions aren't intended by this proposal.
At the other end of the spectrum, given that this is a Cassandra
Enhancement Proposal, and the CEP does in fact state this, it seems like
providing equivalent functionality to already existing LWT is a goal. So my
question is whether I should just* think of this as "better and more
efficient LWT" or is there something more? Would this CEP or follow-up work
introduce any new CQL syntax, for example?

To give just one more example of the kind of questions I'm triangulating
at: Suppose I wanted to do a long running read-only transaction, such as
querying a secondary index. Like SERIAL in current Cassandra, but taking
seconds to execute and returning thousands of rows. How would you see the
possibilities and limits of such operations in Accord?

*) Should emphasize that better scaling LWTs isn't just "just". If I
imagine a future Cassandra cluster where all reads and writes are
transactional and therefore strict serializeable, that would be quite a
change from today.

henrik