Re: CEP-15 Update

Paulo Motta Fri, 18 Apr 2025 07:12:37 -0700

Awesome milestone, congrats and thanks to all involved! 👏👏👏

On Fri, 18 Apr 2025 at 05:19 Dmitry Konstantinov <[email protected]> wrote:


> Hooray! Huge thanks to all! Now, I have no more excuses — it's time to try
> it :-D
>
> On Thu, 17 Apr 2025 at 23:42, Jordan West <[email protected]> wrote:
>
>> Congrats all! My previous reservations (that have been addressed) aside,
>> this is an amazing milestone. Awesome, awesome work!
>>
>> Jordan
>>
>> On Thu, Apr 17, 2025 at 15:07 David Capwell <[email protected]> wrote:
>>
>>> I have merged cep-15-accord into trunk.  If you experience any issues
>>> please reach out to me
>>>
>>>
>>> On Apr 17, 2025, at 12:55 AM, Benedict Elliott Smith <
>>> [email protected]> wrote:
>>>
>>> Final update: David has completed a second rebase after we reached
>>> parity with trunk on our CI, and has confirmed tests remain stable. So I
>>> expect CEP-15 to merge to trunk sometime today.
>>>
>>> No doubt there will be some unexpected disruption to others after a
>>> patch like this lands. Reach out via slack if you have any trouble.
>>>
>>> On 16 Mar 2025, at 10:44, Benedict Elliott Smith <[email protected]>
>>> wrote:
>>>
>>> Hi everyone,
>>>
>>> To update you: the last patches we considered blockers have landed in
>>> the cep-15-accord branch. Caleb has now started rebasing the branch onto
>>> trunk. I expect there will be a few failing tests still to resolve at that
>>> point, but once they have been squashed we will proceed with the merge.
>>>
>>> There remains more work to do before release, and I will publish a
>>> detailed roadmap to Jira when I’m back in a couple of weeks.
>>>
>>>
>>> On 11 Mar 2025, at 20:12, Nate McCall <[email protected]> wrote:
>>>
>>> It sounds like we are all pretty interested in seeing this feature land
>>> and the branch maintenance is causing overhead that could be spent on
>>> finalisation. +1 on merging, particularly given the feature flag work.
>>>
>>> Once more unto the breach 💪
>>>
>>> On Fri, 7 Mar 2025 at 6:56 PM, Benedict <[email protected]> wrote:
>>>
>>>> There are essentially three possible timelines to choose from here:
>>>>
>>>> 1) We agree in the next few days to merge to trunk. We will then
>>>> prioritise rebasing onto trunk and resolving any pre-merge items starting
>>>> next week.
>>>> 2) There’s some more debate and agreement to merge to trunk in a week
>>>> or two. In the meantime we will shift to internal-first development but
>>>> we’ll likely prioritise the above work as soon as we can, which may be in a
>>>> few weeks, so we can shift to trunk first development.
>>>> 3) We don’t agree to merge accord anytime soon, so we shift to
>>>> internal-first development for the time being. I’m not sure when we will
>>>> prioritise any of the above.
>>>>
>>>> Our resources are finite and we’ve exhausted them (literally), so it’s
>>>> pretty much pick one of the above. I don’t really mind which you pick, but
>>>> I won’t personally be prioritising merge after this third attempt.
>>>>
>>>> On 6 Mar 2025, at 22:01, Jon Haddad <[email protected]> wrote:
>>>>
>>>> 
>>>>
>>>> Hmm... I took a look at the cep-15-accord branch in GitHub, it looks
>>>> like it's several hundred commits behind trunk.  Since you'll need to
>>>> rebase again before merge *anyways*, would it make sense to do it once
>>>> more, and I can publish easy-cass-lab with the latest branch?  If folks
>>>> have concerns, it's easy to fire up a cluster (I do it constantly) and try
>>>> it out.
>>>>
>>>> I think if we were to do this, out of consideration we should time box
>>>> the amount of time for an evaluation and unless someone raises an
>>>> objection, consider lazy consensus achieved.
>>>>
>>>> Jon
>>>>
>>>>
>>>>
>>>> On Thu, Mar 6, 2025 at 12:46 PM Benedict Elliott Smith <
>>>> [email protected]> wrote:
>>>>
>>>>> Because we want to validate against the latest code in trunk, else we
>>>>> are validating stale behaviours. The cost of rebasing is high, so we do 
>>>>> not
>>>>> do it frequently. That means we will likely stop developing OSS-first, as
>>>>> the focus will have to move to our internal branch that satisfies these
>>>>> criteria.
>>>>>
>>>>> Exactly what this might be for upstreaming I cannot say. Personally, I
>>>>> aim to work exclusively on the branch we are stabilising. If that is not
>>>>> trunk, the latency for my contributions being made public might be high, 
>>>>> as
>>>>> I have a huge imbalance of over-investment to recoup, and anything
>>>>> unnecessary will be deferred.
>>>>>
>>>>> Since the feature is disabled, and the code is almost entirely
>>>>> isolated, I cannot imagine the cost to the community to removing this work
>>>>> would be very high. But, I do not intend to argue Accord’s case here. I
>>>>> will let you all decide.
>>>>>
>>>>> Please decide soon though, as it shapes our work planning. The
>>>>> positive reception so far had lead me to consider prioritising a move to
>>>>> trunk-first development within the next week or two, and the associated
>>>>> work that entails. However, if that was optimistic we will have to shift
>>>>> our plans.
>>>>>
>>>>>
>>>>>
>>>>> On 6 Mar 2025, at 20:16, Jordan West <[email protected]> wrote:
>>>>>
>>>>> The work and effort in accord has been amazing. And I’m sure it sets a
>>>>> new standard for code quality and correctness testing which I’m also
>>>>> entirely behind. I also trust the folks working on it want to take it to
>>>>> the a fully production ready solution. But I’m worried about circumstances
>>>>> out of our control leaving us with a very complex feature that isn’t
>>>>> complete.
>>>>>
>>>>> I do have some questions. Could folks help me better understand why
>>>>> testing real workloads necessitates a merge (my understanding from the
>>>>> original reason is this is the impetus for why we would merge now)? Also I
>>>>> think the performance and scheme change caveats are rather large ones. One
>>>>> of accords promise was better performance and I think making schema 
>>>>> changes
>>>>> with nodes down not being supported is a big gap. Could we have some
>>>>> criteria like “supports all the operations PaxosV2 supports” or “performs
>>>>> as well or better than PaxosV2 on [workload(s)]”?
>>>>>
>>>>> I understand waiting asks a lot of the authors in terms of baring the
>>>>> burden of a more complex merge. But I think we also need to consider what
>>>>> merging is asking the community to bear if the worst happens and we are
>>>>> unable to take the feature from its current state to something that can be
>>>>> widely used in production.
>>>>>
>>>>>
>>>>> Jordan
>>>>>
>>>>>
>>>>> On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +1 to merging it
>>>>>>
>>>>>> On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote:
>>>>>>
>>>>>> You have my +1
>>>>>>
>>>>>> On Wed, Mar 5, 2025 at 12:16 PM Benedict <[email protected]> wrote:
>>>>>> >
>>>>>> > Correct, these caveats should only apply to tables that have
>>>>>> opted-in to accord.
>>>>>> >
>>>>>> > On 5 Mar 2025, at 20:08, Jeremiah Jordan <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > 
>>>>>> > So great to see all this hard work about to pay off!
>>>>>> >
>>>>>> > On the questions/concerns front, the only concern I would have
>>>>>> towards merging this to trunk is if any of the caveats apply when someone
>>>>>> is not using Accord.  Assuming they only apply when the feature flag is
>>>>>> enabled, I see no reason not to get this merged into trunk once everyone
>>>>>> involved is happy with the state of it.
>>>>>> >
>>>>>> > -Jeremiah
>>>>>> >
>>>>>> > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith <
>>>>>> [email protected]> wrote:
>>>>>> >>
>>>>>> >> That depends on all of you lovely people :D
>>>>>> >>
>>>>>> >> I think we should have finished merging everything we want before
>>>>>> QA by ~Monday; certainly not much later.
>>>>>> >>
>>>>>> >> I think we have some upgrade and python dtest failures to address
>>>>>> as well.
>>>>>> >>
>>>>>> >> So it could be pretty soon if the community is supportive.
>>>>>> >>
>>>>>> >> On 5 Mar 2025, at 17:22, Patrick McFadin <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >>
>>>>>> >> What is the timing for starting the merge process? I'm asking
>>>>>> because
>>>>>> >>
>>>>>> >> I have (yet another) presentation and this would be a cool update.
>>>>>> >>
>>>>>> >>
>>>>>> >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith
>>>>>> >>
>>>>>> >> <[email protected]> wrote:
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > Thanks everyone.
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > Jon - your help will be greatly appreciated. We’ll let you know
>>>>>> when we’ve got the cycles to invest in performance work (hopefully fairly
>>>>>> soon). I expect the first step will be improving visibility so we can
>>>>>> better understand what the system is doing (particularly the caching
>>>>>> layers), but we can dig in together when ready.
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > On 4 Mar 2025, at 18:15, Jon Haddad <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > Very exciting!
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > I have a client that's very interested in Accord, so I should
>>>>>> have budget to dig into it, especially on the performance side of things.
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > Jon
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov <
>>>>>> [email protected]> wrote:
>>>>>> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> >> >> Thank you to all Accord and TCM contributors, it is really
>>>>>> exciting to see a development of such huge and wonderful features moving
>>>>>> forward and opening the door to the new Cassandra epoch!
>>>>>> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston <
>>>>>> [email protected]> wrote:
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> Thanks Benedict!
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> I’m really excited to see accord reach this milestone, even
>>>>>> with these caveats. You seem to have left yourself off the list of
>>>>>> contributors though, even though you’ve been a central figure in its
>>>>>> development :) So thanks to all accord & tcm contributors, including
>>>>>> Benedict, for making this possible!
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote:
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> Hi everyone,
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> It’s been exactly 3.5 years since the first commit to
>>>>>> cassandra-accord. Yes, really, it’s been that long.
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> We will be starting to validate the feature against real
>>>>>> workloads in the near future, so we can’t sensibly push off merging much
>>>>>> longer. The following is a brief run-down of the state of play. There are
>>>>>> no known bugs, but there remain a number of caveats we will be
>>>>>> incrementally addressing in the run-up to a full release:
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> [1] Accord is likely to be SLOW until further optimisations
>>>>>> are implemented
>>>>>> >>
>>>>>> >> >>> [2] Schema changes have a number of hard edges
>>>>>> >>
>>>>>> >> >>> [3] Validation is ongoing, so there are likely still a number
>>>>>> of bugs to shake out
>>>>>> >>
>>>>>> >> >>> [4] Many operator visibility/tooling/documentation
>>>>>> improvements are pending
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> To expand a little:
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> [1] As of the last experiment we conducted, accord’s
>>>>>> throughput was poor - also leading to higher LAN latencies. We have done 
>>>>>> no
>>>>>> WAN experiments to date, but the protocol guarantees should already 
>>>>>> achieve
>>>>>> better round-trip performance, in particular under contention. Improving
>>>>>> throughput will be the main focus of attention once we are satisfied the
>>>>>> protocol is otherwise stable, but our focus remains validation for the
>>>>>> moment.
>>>>>> >>
>>>>>> >> >>> [2] Schema changes have not yet been well integrated with TCM.
>>>>>> Dropping a table for instance will currently cause problems if nodes are
>>>>>> offline.
>>>>>> >>
>>>>>> >> >>> [3] We have a range of validations we are already performing
>>>>>> against cassandra-accord directly, and against its integration with
>>>>>> Cassandra in cep-15-accord. We have run hundreds of billions of simulated
>>>>>> transactions, and are still discovering some minor fault every few 
>>>>>> billion
>>>>>> simulated transactions or so. There remains a lot more simulated 
>>>>>> validation
>>>>>> to explore, as well as with real clusters serving real workloads.
>>>>>> >>
>>>>>> >> >>> [4] There are already a range of virtual tables for exploring
>>>>>> internal state in Accord, and reasonably good metric support. However,
>>>>>> tracing is not yet supported, and our metric and virtual table 
>>>>>> integrations
>>>>>> need some further development.
>>>>>> >>
>>>>>> >> >>> [5] There are also other edge cases to address such as
>>>>>> ensuring we do not reuse HLCs after restart, supporting
>>>>>> ByteOrderPartitioner, and live migration from/to Paxos is undergoing
>>>>>> fine-tuning and validation; probably there are some other things I am
>>>>>> forgetting.
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> Altogether the feature is fairly mature, despite these
>>>>>> caveats. This is the fruit of the labour of a long list of contributors,
>>>>>> including Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake 
>>>>>> Eggleston,
>>>>>> Caleb Rackliffe and David Capwell, and represents a huge undertaking. It
>>>>>> also wouldn’t have been possible without the work of Alex Petrov, Marcus
>>>>>> Eriksson and Sam Tunnicliffe on delivering transactional cluster 
>>>>>> metadata.
>>>>>> I hope you will join me in thanking them all for their contributions.
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> Alex has also kindly produced some initial overview
>>>>>> documentation for developers, that can be found here:
>>>>>> https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc.
>>>>>> This will be expanded as time permits.
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>> Does anyone have any questions or concerns?
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> >> >>
>>>>>> >>
>>>>>> >> >> --
>>>>>> >>
>>>>>> >> >> Dmitry Konstantinov
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>>
>
> --
> Dmitry Konstantinov
>

Re: CEP-15 Update

Reply via email to