Hi everyone, To update you: the last patches we considered blockers have landed in the cep-15-accord branch. Caleb has now started rebasing the branch onto trunk. I expect there will be a few failing tests still to resolve at that point, but once they have been squashed we will proceed with the merge.
There remains more work to do before release, and I will publish a detailed roadmap to Jira when I’m back in a couple of weeks. > On 11 Mar 2025, at 20:12, Nate McCall <zznat...@gmail.com> wrote: > > It sounds like we are all pretty interested in seeing this feature land and > the branch maintenance is causing overhead that could be spent on > finalisation. +1 on merging, particularly given the feature flag work. > > Once more unto the breach 💪 > > On Fri, 7 Mar 2025 at 6:56 PM, Benedict <bened...@apache.org > <mailto:bened...@apache.org>> wrote: >> There are essentially three possible timelines to choose from here: >> >> 1) We agree in the next few days to merge to trunk. We will then prioritise >> rebasing onto trunk and resolving any pre-merge items starting next week. >> 2) There’s some more debate and agreement to merge to trunk in a week or >> two. In the meantime we will shift to internal-first development but we’ll >> likely prioritise the above work as soon as we can, which may be in a few >> weeks, so we can shift to trunk first development. >> 3) We don’t agree to merge accord anytime soon, so we shift to >> internal-first development for the time being. I’m not sure when we will >> prioritise any of the above. >> >> Our resources are finite and we’ve exhausted them (literally), so it’s >> pretty much pick one of the above. I don’t really mind which you pick, but I >> won’t personally be prioritising merge after this third attempt. >> >>> On 6 Mar 2025, at 22:01, Jon Haddad <j...@rustyrazorblade.com >>> <mailto:j...@rustyrazorblade.com>> wrote: >>> >>> >> >>> Hmm... I took a look at the cep-15-accord branch in GitHub, it looks like >>> it's several hundred commits behind trunk. Since you'll need to rebase >>> again before merge *anyways*, would it make sense to do it once more, and I >>> can publish easy-cass-lab with the latest branch? If folks have concerns, >>> it's easy to fire up a cluster (I do it constantly) and try it out. >>> >>> I think if we were to do this, out of consideration we should time box the >>> amount of time for an evaluation and unless someone raises an objection, >>> consider lazy consensus achieved. >>> >>> Jon >>> >>> >>> >>> On Thu, Mar 6, 2025 at 12:46 PM Benedict Elliott Smith <bened...@apache.org >>> <mailto:bened...@apache.org>> wrote: >>>> Because we want to validate against the latest code in trunk, else we are >>>> validating stale behaviours. The cost of rebasing is high, so we do not do >>>> it frequently. That means we will likely stop developing OSS-first, as the >>>> focus will have to move to our internal branch that satisfies these >>>> criteria. >>>> >>>> Exactly what this might be for upstreaming I cannot say. Personally, I aim >>>> to work exclusively on the branch we are stabilising. If that is not >>>> trunk, the latency for my contributions being made public might be high, >>>> as I have a huge imbalance of over-investment to recoup, and anything >>>> unnecessary will be deferred. >>>> >>>> Since the feature is disabled, and the code is almost entirely isolated, I >>>> cannot imagine the cost to the community to removing this work would be >>>> very high. But, I do not intend to argue Accord’s case here. I will let >>>> you all decide. >>>> >>>> Please decide soon though, as it shapes our work planning. The positive >>>> reception so far had lead me to consider prioritising a move to >>>> trunk-first development within the next week or two, and the associated >>>> work that entails. However, if that was optimistic we will have to shift >>>> our plans. >>>> >>>> >>>> >>>>> On 6 Mar 2025, at 20:16, Jordan West <jw...@apache.org >>>>> <mailto:jw...@apache.org>> wrote: >>>>> >>>>> The work and effort in accord has been amazing. And I’m sure it sets a >>>>> new standard for code quality and correctness testing which I’m also >>>>> entirely behind. I also trust the folks working on it want to take it to >>>>> the a fully production ready solution. But I’m worried about >>>>> circumstances out of our control leaving us with a very complex feature >>>>> that isn’t complete. >>>>> >>>>> I do have some questions. Could folks help me better understand why >>>>> testing real workloads necessitates a merge (my understanding from the >>>>> original reason is this is the impetus for why we would merge now)? Also >>>>> I think the performance and scheme change caveats are rather large ones. >>>>> One of accords promise was better performance and I think making schema >>>>> changes with nodes down not being supported is a big gap. Could we have >>>>> some criteria like “supports all the operations PaxosV2 supports” or >>>>> “performs as well or better than PaxosV2 on [workload(s)]”? >>>>> >>>>> I understand waiting asks a lot of the authors in terms of baring the >>>>> burden of a more complex merge. But I think we also need to consider what >>>>> merging is asking the community to bear if the worst happens and we are >>>>> unable to take the feature from its current state to something that can >>>>> be widely used in production. >>>>> >>>>> >>>>> Jordan >>>>> >>>>> >>>>> On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <bl...@ultrablake.com >>>>> <mailto:bl...@ultrablake.com>> wrote: >>>>>> +1 to merging it >>>>>> >>>>>> On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote: >>>>>>> You have my +1 >>>>>>> >>>>>>> On Wed, Mar 5, 2025 at 12:16 PM Benedict <bened...@apache.org >>>>>>> <mailto:bened...@apache.org>> wrote: >>>>>>> > >>>>>>> > Correct, these caveats should only apply to tables that have opted-in >>>>>>> > to accord. >>>>>>> > >>>>>>> > On 5 Mar 2025, at 20:08, Jeremiah Jordan <jerem...@apache.org >>>>>>> > <mailto:jerem...@apache.org>> wrote: >>>>>>> > >>>>>>> > >>>>>>> > So great to see all this hard work about to pay off! >>>>>>> > >>>>>>> > On the questions/concerns front, the only concern I would have >>>>>>> > towards merging this to trunk is if any of the caveats apply when >>>>>>> > someone is not using Accord. Assuming they only apply when the >>>>>>> > feature flag is enabled, I see no reason not to get this merged into >>>>>>> > trunk once everyone involved is happy with the state of it. >>>>>>> > >>>>>>> > -Jeremiah >>>>>>> > >>>>>>> > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith >>>>>>> > <bened...@apache.org <mailto:bened...@apache.org>> wrote: >>>>>>> >> >>>>>>> >> That depends on all of you lovely people :D >>>>>>> >> >>>>>>> >> I think we should have finished merging everything we want before QA >>>>>>> >> by ~Monday; certainly not much later. >>>>>>> >> >>>>>>> >> I think we have some upgrade and python dtest failures to address as >>>>>>> >> well. >>>>>>> >> >>>>>>> >> So it could be pretty soon if the community is supportive. >>>>>>> >> >>>>>>> >> On 5 Mar 2025, at 17:22, Patrick McFadin <pmcfa...@gmail.com >>>>>>> >> <mailto:pmcfa...@gmail.com>> wrote: >>>>>>> >> >>>>>>> >> >>>>>>> >> What is the timing for starting the merge process? I'm asking because >>>>>>> >> >>>>>>> >> I have (yet another) presentation and this would be a cool update. >>>>>>> >> >>>>>>> >> >>>>>>> >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith >>>>>>> >> >>>>>>> >> <bened...@apache.org <mailto:bened...@apache.org>> wrote: >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > Thanks everyone. >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > Jon - your help will be greatly appreciated. We’ll let you know >>>>>>> >> > when we’ve got the cycles to invest in performance work (hopefully >>>>>>> >> > fairly soon). I expect the first step will be improving visibility >>>>>>> >> > so we can better understand what the system is doing (particularly >>>>>>> >> > the caching layers), but we can dig in together when ready. >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > On 4 Mar 2025, at 18:15, Jon Haddad <j...@rustyrazorblade.com >>>>>>> >> > <mailto:j...@rustyrazorblade.com>> wrote: >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > Very exciting! >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > I have a client that's very interested in Accord, so I should have >>>>>>> >> > budget to dig into it, especially on the performance side of >>>>>>> >> > things. >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > Jon >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov >>>>>>> >> > <netud...@gmail.com <mailto:netud...@gmail.com>> wrote: >>>>>>> >> >>>>>>> >> >> >>>>>>> >> >>>>>>> >> >> Thank you to all Accord and TCM contributors, it is really >>>>>>> >> >> exciting to see a development of such huge and wonderful features >>>>>>> >> >> moving forward and opening the door to the new Cassandra epoch! >>>>>>> >> >>>>>>> >> >> >>>>>>> >> >>>>>>> >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston >>>>>>> >> >> <bl...@ultrablake.com <mailto:bl...@ultrablake.com>> wrote: >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> Thanks Benedict! >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> I’m really excited to see accord reach this milestone, even with >>>>>>> >> >>> these caveats. You seem to have left yourself off the list of >>>>>>> >> >>> contributors though, even though you’ve been a central figure in >>>>>>> >> >>> its development :) So thanks to all accord & tcm contributors, >>>>>>> >> >>> including Benedict, for making this possible! >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote: >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> Hi everyone, >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> It’s been exactly 3.5 years since the first commit to >>>>>>> >> >>> cassandra-accord. Yes, really, it’s been that long. >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> We will be starting to validate the feature against real >>>>>>> >> >>> workloads in the near future, so we can’t sensibly push off >>>>>>> >> >>> merging much longer. The following is a brief run-down of the >>>>>>> >> >>> state of play. There are no known bugs, but there remain a >>>>>>> >> >>> number of caveats we will be incrementally addressing in the >>>>>>> >> >>> run-up to a full release: >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> [1] Accord is likely to be SLOW until further optimisations are >>>>>>> >> >>> implemented >>>>>>> >> >>>>>>> >> >>> [2] Schema changes have a number of hard edges >>>>>>> >> >>>>>>> >> >>> [3] Validation is ongoing, so there are likely still a number of >>>>>>> >> >>> bugs to shake out >>>>>>> >> >>>>>>> >> >>> [4] Many operator visibility/tooling/documentation improvements >>>>>>> >> >>> are pending >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> To expand a little: >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> [1] As of the last experiment we conducted, accord’s throughput >>>>>>> >> >>> was poor - also leading to higher LAN latencies. We have done no >>>>>>> >> >>> WAN experiments to date, but the protocol guarantees should >>>>>>> >> >>> already achieve better round-trip performance, in particular >>>>>>> >> >>> under contention. Improving throughput will be the main focus of >>>>>>> >> >>> attention once we are satisfied the protocol is otherwise >>>>>>> >> >>> stable, but our focus remains validation for the moment. >>>>>>> >> >>>>>>> >> >>> [2] Schema changes have not yet been well integrated with TCM. >>>>>>> >> >>> Dropping a table for instance will currently cause problems if >>>>>>> >> >>> nodes are offline. >>>>>>> >> >>>>>>> >> >>> [3] We have a range of validations we are already performing >>>>>>> >> >>> against cassandra-accord directly, and against its integration >>>>>>> >> >>> with Cassandra in cep-15-accord. We have run hundreds of >>>>>>> >> >>> billions of simulated transactions, and are still discovering >>>>>>> >> >>> some minor fault every few billion simulated transactions or so. >>>>>>> >> >>> There remains a lot more simulated validation to explore, as >>>>>>> >> >>> well as with real clusters serving real workloads. >>>>>>> >> >>>>>>> >> >>> [4] There are already a range of virtual tables for exploring >>>>>>> >> >>> internal state in Accord, and reasonably good metric support. >>>>>>> >> >>> However, tracing is not yet supported, and our metric and >>>>>>> >> >>> virtual table integrations need some further development. >>>>>>> >> >>>>>>> >> >>> [5] There are also other edge cases to address such as ensuring >>>>>>> >> >>> we do not reuse HLCs after restart, supporting >>>>>>> >> >>> ByteOrderPartitioner, and live migration from/to Paxos is >>>>>>> >> >>> undergoing fine-tuning and validation; probably there are some >>>>>>> >> >>> other things I am forgetting. >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> Altogether the feature is fairly mature, despite these caveats. >>>>>>> >> >>> This is the fruit of the labour of a long list of contributors, >>>>>>> >> >>> including Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake >>>>>>> >> >>> Eggleston, Caleb Rackliffe and David Capwell, and represents a >>>>>>> >> >>> huge undertaking. It also wouldn’t have been possible without >>>>>>> >> >>> the work of Alex Petrov, Marcus Eriksson and Sam Tunnicliffe on >>>>>>> >> >>> delivering transactional cluster metadata. I hope you will join >>>>>>> >> >>> me in thanking them all for their contributions. >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> Alex has also kindly produced some initial overview >>>>>>> >> >>> documentation for developers, that can be found here: >>>>>>> >> >>> https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc. >>>>>>> >> >>> This will be expanded as time permits. >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> Does anyone have any questions or concerns? >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >>> >>>>>>> >> >>>>>>> >> >> >>>>>>> >> >>>>>>> >> >> >>>>>>> >> >>>>>>> >> >> -- >>>>>>> >> >>>>>>> >> >> Dmitry Konstantinov >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> > >>>>>>> >> >>>>>>> >> >>>>>>> >>>>>> >>>>