Merging is certainly not blocked on my account. Benedict, I wouldn’t describe myself as disappointed. It’s awesome work and I’ve tried to acknowledge the amazing correctness testing that’s been done. I think we should have a high bar for big changes like this and I was curious about how we will address some of the issues that concern me. I would’ve personally like to see a bit more written on how but we don’t currently have a good structure for my ask and I recognize that.
Jordan On Mon, Mar 10, 2025 at 06:01 Alex Petrov <al...@coffeenco.de> wrote: > While I agree that time spent working on a feature is not necessarily a > clear indicator of maturity, one can judge the scope of work and thought > that went into Accord by both its separate repository, and the working > branch. > > I think that merging/accepting SASI was not a mistake. There were several > efforts to make it work, and back in 2016 we could've made it quite viable > with just CASSANDRA-11990 and a lot of testing. It did get superseded by > SAI, but I can imagine a universe where SASI would have been developed into > a stable feature. > > > is there a known path forward to fix the drop schema w nodes down issue > and anything written on it? > Yes, there is a clear known path for fixing schema changes, and gladly > they do not require a protocol change, just a slightly deeper integration > with TCM. > > > On Fri, Mar 7, 2025, at 4:44 PM, Jordan West wrote: > > I would love to have my questions answered and see some graphs I don’t > think those are unreasonable asks nor do they take away from the awesome > work done. I was suggesting 1-2 weeks for folks to have the opportunity to > produce that data if the original authors didn’t have time. I also don’t > think that’s unreasonable. but to be clear I’m not blocking anything. If > folks want to merge I am not objecting. > > I do think we should hold features to a high standard and personally “time > worked on a feature” is not a criteria for me when considering why we > should merge. It is absolutely worth recognizing and celebrating the > massive invest and effort made here. It’s just an orthogonal point to me. > As a contrived example: If 15452 was not as impactful performance wise > after a year of on and off work I would’ve happily continue to address it > or take a different approach. SASI took a year and a half or more and I > still regret that we merged it into 3.x in the form we did using the same > early contribution model. That was an example of an extreme, and out of our > control case, of an entire team disbanding right after merge. > > Jordan > > On Fri, Mar 7, 2025 at 06:28 Jon Haddad <j...@rustyrazorblade.com> wrote: > > I defer to the judgement of the folks that are most impacted by it - ones > that are in the code, working on the next release. If you all think it's > good to merge, then I am 100% in support of it. I suspect merging will > help get it out faster, and I don't see any future in which we don't ship > this in the next release. > > I will be happy to help answer the "how does it compare to paxos v2" > question post-merge. > > Jon > > > > On Fri, Mar 7, 2025 at 5:52 AM Josh McKenzie <jmcken...@apache.org> wrote: > > > 3.5 years is an incredible amount of time and work; it really is > significant and thanks to everyone involved for the investment of time and > energy. > > We have a rocky history with large, disruptive contributions in the past > that have either blocked forward progress post-merge (CASSANDRA-8099), or > lingered in the code-base increasing maintenance burden on other > contributors for minimal or no user benefit (early open post SSD > transition, witness replicas, materialized views). I'm sympathetic to where > Jordan's questions stem from, as our history of leaving things in the > codebase long after they've become vestigial or abandoned has slowed down > our collective momentum maintaining the project on actively used features. > > That said, I don't think Accord will run afoul of some of those same > patterns. Aside from the degree of investment already in it and sheer > number of pmc members and committers involved, I believe it's a feature > that's universally impactful and that if we had a metaphorical bus-factor > change (entire group of people working on it disappeared the day after > merge or decided to go on vacation for 5 years), others in the community > would be willing to pick things up and keep it moving given its proximity > to release readiness. > > The 2 questions Jordan asked resonate with me: 1) do we have line of sight > to a fix on the schema issues, and I'll take the liberty of reframing 2) do > we have line of sight to improvement on the performance front to be usable > for multi-key transactions? (subtle: I don't think "parity with PaxosV2" is > the right target, but rather "fast enough to be usable for multi-key > transactions" since it's a new query paradigm). > > Given the context on contributor backing and if the answer is yes to those > 2 questions (which I believe it is), I think we should generally be > comfortable with merging the feature as experimental at this time. > > On Fri, Mar 7, 2025, at 12:54 AM, Benedict wrote: > > > There are essentially three possible timelines to choose from here: > > 1) We agree in the next few days to merge to trunk. We will then > prioritise rebasing onto trunk and resolving any pre-merge items starting > next week. > 2) There’s some more debate and agreement to merge to trunk in a week or > two. In the meantime we will shift to internal-first development but we’ll > likely prioritise the above work as soon as we can, which may be in a few > weeks, so we can shift to trunk first development. > 3) We don’t agree to merge accord anytime soon, so we shift to > internal-first development for the time being. I’m not sure when we will > prioritise any of the above. > > Our resources are finite and we’ve exhausted them (literally), so it’s > pretty much pick one of the above. I don’t really mind which you pick, but > I won’t personally be prioritising merge after this third attempt. > > > On 6 Mar 2025, at 22:01, Jon Haddad <j...@rustyrazorblade.com> wrote: > > > Hmm... I took a look at the cep-15-accord branch in GitHub, it looks like > it's several hundred commits behind trunk. Since you'll need to rebase > again before merge *anyways*, would it make sense to do it once more, and I > can publish easy-cass-lab with the latest branch? If folks have concerns, > it's easy to fire up a cluster (I do it constantly) and try it out. > > I think if we were to do this, out of consideration we should time box the > amount of time for an evaluation and unless someone raises an objection, > consider lazy consensus achieved. > > Jon > > > > On Thu, Mar 6, 2025 at 12:46 PM Benedict Elliott Smith < > bened...@apache.org> wrote: > > Because we want to validate against the latest code in trunk, else we are > validating stale behaviours. The cost of rebasing is high, so we do not do > it frequently. That means we will likely stop developing OSS-first, as the > focus will have to move to our internal branch that satisfies these > criteria. > > Exactly what this might be for upstreaming I cannot say. Personally, I aim > to work exclusively on the branch we are stabilising. If that is not trunk, > the latency for my contributions being made public might be high, as I have > a huge imbalance of over-investment to recoup, and anything unnecessary > will be deferred. > > Since the feature is disabled, and the code is almost entirely isolated, I > cannot imagine the cost to the community to removing this work would be > very high. But, I do not intend to argue Accord’s case here. I will let you > all decide. > > Please decide soon though, as it shapes our work planning. The positive > reception so far had lead me to consider prioritising a move to trunk-first > development within the next week or two, and the associated work that > entails. However, if that was optimistic we will have to shift our plans. > > > > On 6 Mar 2025, at 20:16, Jordan West <jw...@apache.org> wrote: > > The work and effort in accord has been amazing. And I’m sure it sets a new > standard for code quality and correctness testing which I’m also entirely > behind. I also trust the folks working on it want to take it to the a fully > production ready solution. But I’m worried about circumstances out of our > control leaving us with a very complex feature that isn’t complete. > > I do have some questions. Could folks help me better understand why > testing real workloads necessitates a merge (my understanding from the > original reason is this is the impetus for why we would merge now)? Also I > think the performance and scheme change caveats are rather large ones. One > of accords promise was better performance and I think making schema changes > with nodes down not being supported is a big gap. Could we have some > criteria like “supports all the operations PaxosV2 supports” or “performs > as well or better than PaxosV2 on [workload(s)]”? > > I understand waiting asks a lot of the authors in terms of baring the > burden of a more complex merge. But I think we also need to consider what > merging is asking the community to bear if the worst happens and we are > unable to take the feature from its current state to something that can be > widely used in production. > > Jordan > > > On Wed, Mar 5, 2025 at 15:52 Blake Eggleston <bl...@ultrablake.com> wrote: > > > +1 to merging it > > On Wed, Mar 5, 2025, at 12:22 PM, Patrick McFadin wrote: > > You have my +1 > > On Wed, Mar 5, 2025 at 12:16 PM Benedict <bened...@apache.org> wrote: > > > > Correct, these caveats should only apply to tables that have opted-in to > accord. > > > > On 5 Mar 2025, at 20:08, Jeremiah Jordan <jerem...@apache.org> wrote: > > > > > > So great to see all this hard work about to pay off! > > > > On the questions/concerns front, the only concern I would have towards > merging this to trunk is if any of the caveats apply when someone is not > using Accord. Assuming they only apply when the feature flag is enabled, I > see no reason not to get this merged into trunk once everyone involved is > happy with the state of it. > > > > -Jeremiah > > > > On Mar 5, 2025 at 12:15:23 PM, Benedict Elliott Smith < > bened...@apache.org> wrote: > >> > >> That depends on all of you lovely people :D > >> > >> I think we should have finished merging everything we want before QA by > ~Monday; certainly not much later. > >> > >> I think we have some upgrade and python dtest failures to address as > well. > >> > >> So it could be pretty soon if the community is supportive. > >> > >> On 5 Mar 2025, at 17:22, Patrick McFadin <pmcfa...@gmail.com> wrote: > >> > >> > >> What is the timing for starting the merge process? I'm asking because > >> > >> I have (yet another) presentation and this would be a cool update. > >> > >> > >> On Wed, Mar 5, 2025 at 1:22 AM Benedict Elliott Smith > >> > >> <bened...@apache.org> wrote: > >> > >> > > >> > >> > Thanks everyone. > >> > >> > > >> > >> > Jon - your help will be greatly appreciated. We’ll let you know when > we’ve got the cycles to invest in performance work (hopefully fairly soon). > I expect the first step will be improving visibility so we can better > understand what the system is doing (particularly the caching layers), but > we can dig in together when ready. > >> > >> > > >> > >> > On 4 Mar 2025, at 18:15, Jon Haddad <j...@rustyrazorblade.com> wrote: > >> > >> > > >> > >> > Very exciting! > >> > >> > > >> > >> > I have a client that's very interested in Accord, so I should have > budget to dig into it, especially on the performance side of things. > >> > >> > > >> > >> > Jon > >> > >> > > >> > >> > On Tue, Mar 4, 2025 at 9:57 AM Dmitry Konstantinov < > netud...@gmail.com> wrote: > >> > >> >> > >> > >> >> Thank you to all Accord and TCM contributors, it is really exciting > to see a development of such huge and wonderful features moving forward and > opening the door to the new Cassandra epoch! > >> > >> >> > >> > >> >> On Tue, 4 Mar 2025 at 20:45, Blake Eggleston <bl...@ultrablake.com> > wrote: > >> > >> >>> > >> > >> >>> Thanks Benedict! > >> > >> >>> > >> > >> >>> I’m really excited to see accord reach this milestone, even with > these caveats. You seem to have left yourself off the list of contributors > though, even though you’ve been a central figure in its development :) So > thanks to all accord & tcm contributors, including Benedict, for making > this possible! > >> > >> >>> > >> > >> >>> On Tue, Mar 4, 2025, at 8:00 AM, Benedict Elliott Smith wrote: > >> > >> >>> > >> > >> >>> Hi everyone, > >> > >> >>> > >> > >> >>> It’s been exactly 3.5 years since the first commit to > cassandra-accord. Yes, really, it’s been that long. > >> > >> >>> > >> > >> >>> We will be starting to validate the feature against real workloads > in the near future, so we can’t sensibly push off merging much longer. The > following is a brief run-down of the state of play. There are no known > bugs, but there remain a number of caveats we will be incrementally > addressing in the run-up to a full release: > >> > >> >>> > >> > >> >>> [1] Accord is likely to be SLOW until further optimisations are > implemented > >> > >> >>> [2] Schema changes have a number of hard edges > >> > >> >>> [3] Validation is ongoing, so there are likely still a number of > bugs to shake out > >> > >> >>> [4] Many operator visibility/tooling/documentation improvements are > pending > >> > >> >>> > >> > >> >>> To expand a little: > >> > >> >>> > >> > >> >>> [1] As of the last experiment we conducted, accord’s throughput was > poor - also leading to higher LAN latencies. We have done no WAN > experiments to date, but the protocol guarantees should already achieve > better round-trip performance, in particular under contention. Improving > throughput will be the main focus of attention once we are satisfied the > protocol is otherwise stable, but our focus remains validation for the > moment. > >> > >> >>> [2] Schema changes have not yet been well integrated with TCM. > Dropping a table for instance will currently cause problems if nodes are > offline. > >> > >> >>> [3] We have a range of validations we are already performing > against cassandra-accord directly, and against its integration with > Cassandra in cep-15-accord. We have run hundreds of billions of simulated > transactions, and are still discovering some minor fault every few billion > simulated transactions or so. There remains a lot more simulated validation > to explore, as well as with real clusters serving real workloads. > >> > >> >>> [4] There are already a range of virtual tables for exploring > internal state in Accord, and reasonably good metric support. However, > tracing is not yet supported, and our metric and virtual table integrations > need some further development. > >> > >> >>> [5] There are also other edge cases to address such as ensuring we > do not reuse HLCs after restart, supporting ByteOrderPartitioner, and live > migration from/to Paxos is undergoing fine-tuning and validation; probably > there are some other things I am forgetting. > >> > >> >>> > >> > >> >>> Altogether the feature is fairly mature, despite these caveats. > This is the fruit of the labour of a long list of contributors, including > Aleksey Yeschenko, Alex Petrov, Ariel Weisberg, Blake Eggleston, Caleb > Rackliffe and David Capwell, and represents a huge undertaking. It also > wouldn’t have been possible without the work of Alex Petrov, Marcus > Eriksson and Sam Tunnicliffe on delivering transactional cluster metadata. > I hope you will join me in thanking them all for their contributions. > >> > >> >>> > >> > >> >>> Alex has also kindly produced some initial overview documentation > for developers, that can be found here: > https://github.com/apache/cassandra/blob/cep-15-accord/doc/modules/cassandra/pages/developing/accord/index.adoc. > This will be expanded as time permits. > >> > >> >>> > >> > >> >>> Does anyone have any questions or concerns? > >> > >> >>> > >> > >> >>> > >> > >> >> > >> > >> >> > >> > >> >> -- > >> > >> >> Dmitry Konstantinov > >> > >> > > >> > >> > > >> > >> > > > > >