Re: CEP-15 Update

Jordan West Thu, 06 Mar 2025 12:17:25 -0800

The work and effort in accord has been amazing. And I’m sure it sets a new
standard for code quality and correctness testing which I’m also entirely
behind. I also trust the folks working on it want to take it to the a fully
production ready solution. But I’m worried about circumstances out of our
control leaving us with a very complex feature that isn’t complete.


I do have some questions. Could folks help me better understand why testing
real workloads necessitates a merge (my understanding from the original
reason is this is the impetus for why we would merge now)? Also I think the
performance and scheme change caveats are rather large ones. One of accords
promise was better performance and I think making schema changes with nodes
down not being supported is a big gap. Could we have some criteria like
“supports all the operations PaxosV2 supports” or “performs as well or
better than PaxosV2 on [workload(s)]”?

I understand waiting asks a lot of the authors in terms of baring the
burden of a more complex merge. But I think we also need to consider what
merging is asking the community to bear if the worst happens and we are
unable to take the feature from its current state to something that can be
widely used in production.


Jordan


On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <[email protected]> wrote:

> +1 to merging it
>
> On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote:
>
> You have my +1
>
> On Wed, Mar 5, 2025 at 12:16 PM Benedict <[email protected]> wrote:
> >
> > Correct, these caveats should only apply to tables that have opted-in to
> accord.
> >
> > On 5 Mar 2025, at 20:08, Jeremiah Jordan <[email protected]> wrote:
> >
> > 
> > So great to see all this hard work about to pay off!
> >
> > On the questions/concerns front, the only concern I would have towards
> merging this to trunk is if any of the caveats apply when someone is not
> using Accord.  Assuming they only apply when the feature flag is enabled, I
> see no reason not to get this merged into trunk once everyone involved is
> happy with the state of it.
> >
> > -Jeremiah
> >
> > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith <
> [email protected]> wrote:
> >>
> >> That depends on all of you lovely people :D
> >>
> >> I think we should have finished merging everything we want before QA by
> ~Monday; certainly not much later.
> >>
> >> I think we have some upgrade and python dtest failures to address as
> well.
> >>
> >> So it could be pretty soon if the community is supportive.
> >>
> >> On 5 Mar 2025, at 17:22, Patrick McFadin <[email protected]> wrote:
> >>
> >>
> >> What is the timing for starting the merge process? I'm asking because
> >>
> >> I have (yet another) presentation and this would be a cool update.
> >>
> >>
> >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith
> >>
> >> <[email protected]> wrote:
> >>
> >> >
> >>
> >> > Thanks everyone.
> >>
> >> >
> >>
> >> > Jon - your help will be greatly appreciated. We’ll let you know when
> we’ve got the cycles to invest in performance work (hopefully fairly soon).
> I expect the first step will be improving visibility so we can better
> understand what the system is doing (particularly the caching layers), but
> we can dig in together when ready.
> >>
> >> >
> >>
> >> > On 4 Mar 2025, at 18:15, Jon Haddad <[email protected]> wrote:
> >>
> >> >
> >>
> >> > Very exciting!
> >>
> >> >
> >>
> >> > I have a client that's very interested in Accord, so I should have
> budget to dig into it, especially on the performance side of things.
> >>
> >> >
> >>
> >> > Jon
> >>
> >> >
> >>
> >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov <
> [email protected]> wrote:
> >>
> >> >>
> >>
> >> >> Thank you to all Accord and TCM contributors, it is really exciting
> to see a development of such huge and wonderful features moving forward and
> opening the door to the new Cassandra epoch!
> >>
> >> >>
> >>
> >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston <[email protected]>
> wrote:
> >>
> >> >>>
> >>
> >> >>> Thanks Benedict!
> >>
> >> >>>
> >>
> >> >>> I’m really excited to see accord reach this milestone, even with
> these caveats. You seem to have left yourself off the list of contributors
> though, even though you’ve been a central figure in its development :) So
> thanks to all accord & tcm contributors, including Benedict, for making
> this possible!
> >>
> >> >>>
> >>
> >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote:
> >>
> >> >>>
> >>
> >> >>> Hi everyone,
> >>
> >> >>>
> >>
> >> >>> It’s been exactly 3.5 years since the first commit to
> cassandra-accord. Yes, really, it’s been that long.
> >>
> >> >>>
> >>
> >> >>> We will be starting to validate the feature against real workloads
> in the near future, so we can’t sensibly push off merging much longer. The
> following is a brief run-down of the state of play. There are no known
> bugs, but there remain a number of caveats we will be incrementally
> addressing in the run-up to a full release:
> >>
> >> >>>
> >>
> >> >>> [1] Accord is likely to be SLOW until further optimisations are
> implemented
> >>
> >> >>> [2] Schema changes have a number of hard edges
> >>
> >> >>> [3] Validation is ongoing, so there are likely still a number of
> bugs to shake out
> >>
> >> >>> [4] Many operator visibility/tooling/documentation improvements are
> pending
> >>
> >> >>>
> >>
> >> >>> To expand a little:
> >>
> >> >>>
> >>
> >> >>> [1] As of the last experiment we conducted, accord’s throughput was
> poor - also leading to higher LAN latencies. We have done no WAN
> experiments to date, but the protocol guarantees should already achieve
> better round-trip performance, in particular under contention. Improving
> throughput will be the main focus of attention once we are satisfied the
> protocol is otherwise stable, but our focus remains validation for the
> moment.
> >>
> >> >>> [2] Schema changes have not yet been well integrated with TCM.
> Dropping a table for instance will currently cause problems if nodes are
> offline.
> >>
> >> >>> [3] We have a range of validations we are already performing
> against cassandra-accord directly, and against its integration with
> Cassandra in cep-15-accord. We have run hundreds of billions of simulated
> transactions, and are still discovering some minor fault every few billion
> simulated transactions or so. There remains a lot more simulated validation
> to explore, as well as with real clusters serving real workloads.
> >>
> >> >>> [4] There are already a range of virtual tables for exploring
> internal state in Accord, and reasonably good metric support. However,
> tracing is not yet supported, and our metric and virtual table integrations
> need some further development.
> >>
> >> >>> [5] There are also other edge cases to address such as ensuring we
> do not reuse HLCs after restart, supporting ByteOrderPartitioner, and live
> migration from/to Paxos is undergoing fine-tuning and validation; probably
> there are some other things I am forgetting.
> >>
> >> >>>
> >>
> >> >>> Altogether the feature is fairly mature, despite these caveats.
> This is the fruit of the labour of a long list of contributors, including
> Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake Eggleston, Caleb
> Rackliffe and David Capwell, and represents a huge undertaking. It also
> wouldn’t have been possible without the work of Alex Petrov, Marcus
> Eriksson and Sam Tunnicliffe on delivering transactional cluster metadata.
> I hope you will join me in thanking them all for their contributions.
> >>
> >> >>>
> >>
> >> >>> Alex has also kindly produced some initial overview documentation
> for developers, that can be found here:
> https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc.
> This will be expanded as time permits.
> >>
> >> >>>
> >>
> >> >>> Does anyone have any questions or concerns?
> >>
> >> >>>
> >>
> >> >>>
> >>
> >> >>
> >>
> >> >>
> >>
> >> >> --
> >>
> >> >> Dmitry Konstantinov
> >>
> >> >
> >>
> >> >
> >>
> >>
>
>
>

Re: CEP-15 Update

Reply via email to