Re: CEP-15 Update

Jeremiah Jordan Thu, 06 Mar 2025 13:51:24 -0800

I have no stake in this feature besides thinking it would enable some great
new end user workloads. Also it actually makes my professional life easier
the longer this code is not merged.


But given the new feature (and its caveats) are behind a yaml flag, as well
as opt in at the table level, I don’t see a reason to keep it from merging
as is.  If this was not the case I would also be against early merging
before the caveats were resolved.  In my opinion since the feature is self
contained and off by default, I see no issues.

On the topic of “what if they throw it over the fence and run away”. Given
the time invested, and the number of committers and PMC members involved in
the development of the feature, does anyone realistically believe they are
going to drop it in 90% finished and then leave it unfinished?

Do we want to eventually have this feature in the database?  My feeling is
that if we do then we should get it merged to trunk so that it can be
completed faster.

I think we have had this discussion in the past about merging semi-finished
CEPs in stages behind feature flags so that more people could get
visibility into them.  The latest thread I can find doesn’t seem to have
anyone completely against it, with some people wanting to make sure
anything merged did not put trunk into an “un-releasable” state and others
expressing they would want such code to be useful as an MVP.
https://lists.apache.org/thread/c39yrbdszgz9s34vr17wpjdhv6h2oo61

Is someone -1 on this merging early behind the flag?  If not I think it can
be merged whenever? AFAIK all the code in the feature branch has the
required two +1’s already.

-Jeremiah

On Thu, Mar 6, 2025 at 2:44 PM Benedict Elliott Smith <[email protected]>
wrote:

> Because we want to validate against the latest code in trunk, else we are
> validating stale behaviours. The cost of rebasing is high, so we do not do
> it frequently. That means we will likely stop developing OSS-first, as the
> focus will have to move to our internal branch that satisfies these
> criteria.
>
> Exactly what this might be for upstreaming I cannot say. Personally, I aim
> to work exclusively on the branch we are stabilising. If that is not trunk,
> the latency for my contributions being made public might be high, as I have
> a huge imbalance of over-investment to recoup, and anything unnecessary
> will be deferred.
>
> Since the feature is disabled, and the code is almost entirely isolated, I
> cannot imagine the cost to the community to removing this work would be
> very high. But, I do not intend to argue Accord’s case here. I will let you
> all decide.
>
> Please decide soon though, as it shapes our work planning. The positive
> reception so far had lead me to consider prioritising a move to trunk-first
> development within the next week or two, and the associated work that
> entails. However, if that was optimistic we will have to shift our plans.
>
>
>
> On 6 Mar 2025, at 20:16, Jordan West <[email protected]> wrote:
>
> The work and effort in accord has been amazing. And I’m sure it sets a new
> standard for code quality and correctness testing which I’m also entirely
> behind. I also trust the folks working on it want to take it to the a fully
> production ready solution. But I’m worried about circumstances out of our
> control leaving us with a very complex feature that isn’t complete.
>
> I do have some questions. Could folks help me better understand why
> testing real workloads necessitates a merge (my understanding from the
> original reason is this is the impetus for why we would merge now)? Also I
> think the performance and scheme change caveats are rather large ones. One
> of accords promise was better performance and I think making schema changes
> with nodes down not being supported is a big gap. Could we have some
> criteria like “supports all the operations PaxosV2 supports” or “performs
> as well or better than PaxosV2 on [workload(s)]”?
>
> I understand waiting asks a lot of the authors in terms of baring the
> burden of a more complex merge. But I think we also need to consider what
> merging is asking the community to bear if the worst happens and we are
> unable to take the feature from its current state to something that can be
> widely used in production.
>
>
> Jordan
>
>
> On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <[email protected]> wrote:
>
>> +1 to merging it
>>
>> On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote:
>>
>> You have my +1
>>
>> On Wed, Mar 5, 2025 at 12:16 PM Benedict <[email protected]> wrote:
>> >
>> > Correct, these caveats should only apply to tables that have opted-in
>> to accord.
>> >
>> > On 5 Mar 2025, at 20:08, Jeremiah Jordan <[email protected]> wrote:
>> >
>> > 
>> > So great to see all this hard work about to pay off!
>> >
>> > On the questions/concerns front, the only concern I would have towards
>> merging this to trunk is if any of the caveats apply when someone is not
>> using Accord.  Assuming they only apply when the feature flag is enabled, I
>> see no reason not to get this merged into trunk once everyone involved is
>> happy with the state of it.
>> >
>> > -Jeremiah
>> >
>> > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith <
>> [email protected]> wrote:
>> >>
>> >> That depends on all of you lovely people :D
>> >>
>> >> I think we should have finished merging everything we want before QA
>> by ~Monday; certainly not much later.
>> >>
>> >> I think we have some upgrade and python dtest failures to address as
>> well.
>> >>
>> >> So it could be pretty soon if the community is supportive.
>> >>
>> >> On 5 Mar 2025, at 17:22, Patrick McFadin <[email protected]> wrote:
>> >>
>> >>
>> >> What is the timing for starting the merge process? I'm asking because
>> >>
>> >> I have (yet another) presentation and this would be a cool update.
>> >>
>> >>
>> >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith
>> >>
>> >> <[email protected]> wrote:
>> >>
>> >> >
>> >>
>> >> > Thanks everyone.
>> >>
>> >> >
>> >>
>> >> > Jon - your help will be greatly appreciated. We’ll let you know when
>> we’ve got the cycles to invest in performance work (hopefully fairly soon).
>> I expect the first step will be improving visibility so we can better
>> understand what the system is doing (particularly the caching layers), but
>> we can dig in together when ready.
>> >>
>> >> >
>> >>
>> >> > On 4 Mar 2025, at 18:15, Jon Haddad <[email protected]> wrote:
>> >>
>> >> >
>> >>
>> >> > Very exciting!
>> >>
>> >> >
>> >>
>> >> > I have a client that's very interested in Accord, so I should have
>> budget to dig into it, especially on the performance side of things.
>> >>
>> >> >
>> >>
>> >> > Jon
>> >>
>> >> >
>> >>
>> >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov <
>> [email protected]> wrote:
>> >>
>> >> >>
>> >>
>> >> >> Thank you to all Accord and TCM contributors, it is really exciting
>> to see a development of such huge and wonderful features moving forward and
>> opening the door to the new Cassandra epoch!
>> >>
>> >> >>
>> >>
>> >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston <[email protected]>
>> wrote:
>> >>
>> >> >>>
>> >>
>> >> >>> Thanks Benedict!
>> >>
>> >> >>>
>> >>
>> >> >>> I’m really excited to see accord reach this milestone, even with
>> these caveats. You seem to have left yourself off the list of contributors
>> though, even though you’ve been a central figure in its development :) So
>> thanks to all accord & tcm contributors, including Benedict, for making
>> this possible!
>> >>
>> >> >>>
>> >>
>> >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote:
>> >>
>> >> >>>
>> >>
>> >> >>> Hi everyone,
>> >>
>> >> >>>
>> >>
>> >> >>> It’s been exactly 3.5 years since the first commit to
>> cassandra-accord. Yes, really, it’s been that long.
>> >>
>> >> >>>
>> >>
>> >> >>> We will be starting to validate the feature against real workloads
>> in the near future, so we can’t sensibly push off merging much longer. The
>> following is a brief run-down of the state of play. There are no known
>> bugs, but there remain a number of caveats we will be incrementally
>> addressing in the run-up to a full release:
>> >>
>> >> >>>
>> >>
>> >> >>> [1] Accord is likely to be SLOW until further optimisations are
>> implemented
>> >>
>> >> >>> [2] Schema changes have a number of hard edges
>> >>
>> >> >>> [3] Validation is ongoing, so there are likely still a number of
>> bugs to shake out
>> >>
>> >> >>> [4] Many operator visibility/tooling/documentation improvements
>> are pending
>> >>
>> >> >>>
>> >>
>> >> >>> To expand a little:
>> >>
>> >> >>>
>> >>
>> >> >>> [1] As of the last experiment we conducted, accord’s throughput
>> was poor - also leading to higher LAN latencies. We have done no WAN
>> experiments to date, but the protocol guarantees should already achieve
>> better round-trip performance, in particular under contention. Improving
>> throughput will be the main focus of attention once we are satisfied the
>> protocol is otherwise stable, but our focus remains validation for the
>> moment.
>> >>
>> >> >>> [2] Schema changes have not yet been well integrated with TCM.
>> Dropping a table for instance will currently cause problems if nodes are
>> offline.
>> >>
>> >> >>> [3] We have a range of validations we are already performing
>> against cassandra-accord directly, and against its integration with
>> Cassandra in cep-15-accord. We have run hundreds of billions of simulated
>> transactions, and are still discovering some minor fault every few billion
>> simulated transactions or so. There remains a lot more simulated validation
>> to explore, as well as with real clusters serving real workloads.
>> >>
>> >> >>> [4] There are already a range of virtual tables for exploring
>> internal state in Accord, and reasonably good metric support. However,
>> tracing is not yet supported, and our metric and virtual table integrations
>> need some further development.
>> >>
>> >> >>> [5] There are also other edge cases to address such as ensuring we
>> do not reuse HLCs after restart, supporting ByteOrderPartitioner, and live
>> migration from/to Paxos is undergoing fine-tuning and validation; probably
>> there are some other things I am forgetting.
>> >>
>> >> >>>
>> >>
>> >> >>> Altogether the feature is fairly mature, despite these caveats.
>> This is the fruit of the labour of a long list of contributors, including
>> Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake Eggleston, Caleb
>> Rackliffe and David Capwell, and represents a huge undertaking. It also
>> wouldn’t have been possible without the work of Alex Petrov, Marcus
>> Eriksson and Sam Tunnicliffe on delivering transactional cluster metadata.
>> I hope you will join me in thanking them all for their contributions.
>> >>
>> >> >>>
>> >>
>> >> >>> Alex has also kindly produced some initial overview documentation
>> for developers, that can be found here:
>> https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc.
>> This will be expanded as time permits.
>> >>
>> >> >>>
>> >>
>> >> >>> Does anyone have any questions or concerns?
>> >>
>> >> >>>
>> >>
>> >> >>>
>> >>
>> >> >>
>> >>
>> >> >>
>> >>
>> >> >> --
>> >>
>> >> >> Dmitry Konstantinov
>> >>
>> >> >
>> >>
>> >> >
>> >>
>> >>
>>
>>
>>
>

Re: CEP-15 Update

Reply via email to