I am personally in favor of the idea and some of the finer details can be worked around I think.
In https://issues.apache.org/jira/browse/CASSANDRA-19769 I added a AST for CQL for tests and improving our testing, which led me to file several tickets as it found bugs…. There is a large part of me that wants to bike shed and I have to tell myself that improving CQL != CBO… > On Sep 19, 2024, at 11:10 AM, Patrick McFadin <pmcfa...@gmail.com> wrote: > > Did this get resolved? Is it ready for a VOTE thread? > > On Tue, Jan 2, 2024 at 1:41 PM Benedict <bened...@apache.org > <mailto:bened...@apache.org>> wrote: >> The CEP expressly includes an item for coordinated cardinality estimation, >> by producing whole cluster summaries. I’m not sure if you addressed this in >> your feedback, it’s not clear what you’re referring to with distributed >> estimates, but avoiding this was expressly the driver of my suggestion to >> instead include the plan as a payload (which offers users some additional >> facilities). >> >> >>> On 2 Jan 2024, at 21:26, Ariel Weisberg <ar...@weisberg.ws >>> <mailto:ar...@weisberg.ws>> wrote: >>> >>> >>> Hi, >>> >>> I am burying the lede, but it's important to keep an eye on >>> runtime-adaptive vs planning time optimization as the cost/benefits vary >>> greatly between the two and runtime adaptive can be a game changer. >>> Basically CBO optimizes for query efficiency and startup time at the >>> expense of not handling some queries well and runtime adaptive is >>> cheap/free for expensive queries and can handle cases that CBO can't. >>> >>> Generally speaking I am +1 on the introduction of a CBO, since it seems >>> like there exists things that would benefit from it materially (and many of >>> the associated refactors/cleanup) and it aligns with my north star that >>> includes joins. >>> >>> Do we all have the same north star that Cassandra should eventually support >>> joins? Just curious if that is controversial. >>> >>> I don't feel like this CEP in particular should need to really nail down >>> exactly how distributed estimates work since we can start with using local >>> estimates as a proxy for the entire cluster and then improve. If someone >>> has bandwidth to do a separate CEP for that then sure that would be great, >>> but this seems big enough in scope already. >>> >>> RE testing, continuity of performance of queries is going to be really >>> important. I would really like to see that we have a fuzzed the space >>> deterministically and via a collection of hand rolled cases, and can >>> compare performance between versions to catch queries that regress. >>> Hopefully we can agree on a baseline for releasing where we know what prior >>> release to compare to and what acceptable changes in performance are. >>> >>> RE prepared statements - It feels to me like trying to send the plan blob >>> back and forth to get more predictable, but not absolutely predictable, >>> plans is not worth it? Feels like a lot for an incremental improvement over >>> a baseline that doesn't exist yet, IOW it doesn't feel like something for >>> V1. Maybe it ends up in YAGNI territory. >>> >>> The north star of predictable behavior for queries is a *very* important >>> one because it means the world to users, but CBO is going to make mistakes >>> all over the place. It's simply unachievable even with accurate statistics >>> because it's very hard to tell how predicates will behave on a column. >>> >>> This segues nicely into the importance of adaptive execution :-) It's how >>> you rescue the queries that CBO doesn't handle well for any reason such as >>> bugs, bad statistics, missing features. Re-ordering predicate evaluation, >>> switching indexes, and re-ordering joins can all be done on the fly. >>> >>> CBO is really a performance optimization since adaptive approaches will >>> allow any query to complete with some wasted resources. >>> >>> If my pager were waking me up at night and I wanted to stem the bleeding I >>> would reach for runtime adaptive over CBO because I know it will catch more >>> cases even if it is slower to execute up front. >>> >>> What is the nature of the queries we are looking solve right now? Are they >>> long running heavy hitters, or short queries that explode if run >>> incorrectly, or a mix of both? >>> >>> Ariel >>> >>> On Tue, Dec 12, 2023, at 8:29 AM, Benjamin Lerer wrote: >>>> Hi everybody, >>>> >>>> I would like to open the discussion on the introduction of a cost based >>>> optimizer to allow Cassandra to pick the best execution plan based on the >>>> data distribution.Therefore, improving the overall query performance. >>>> >>>> This CEP should also lay the groundwork for the future addition of >>>> features like joins, subqueries, OR/NOT and index ordering. >>>> >>>> The proposal is here: >>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-39%3A+Cost+Based+Optimizer >>>> >>>> Thank you in advance for your feedback. >>>