Yeah, avg-distinct is almost certainly NOT the right thing. I don’t have the answer. My best advice is to strive a formula that is simple, deterministic, numerically stable and accurate (a good estimate for the actual selectivity of the query). Find a balance of those 4 factors.
> On Jun 1, 2026, at 2:26 PM, Alessandro Solimando > <[email protected]> wrote: > > Thanks Julian, approximation error is indeed only one of dimensions of > interest, and determinism is probably an even more important property in > this case. > > If variance were guaranteed to be "limited", then I'd pick either min or > max, for the reasons you suggest, and select the average of distinct values > otherwise. > > But I believe avg-distinct is problematic because it's not compositional > (which I mean as avg(A U B) != avg(avg(A), avg(B)), while this is true for > min/max, and metadata handlers do need to compose statistics from > children's stats ("aggregated", so they need to be local). > > Then between min and max it's another hard choice, and it depends on how > the statistics are consumed and used, which also determines if it's more > problematic to overestimate or underestimate. > > My (unproven) feeling is that underestimation is more problematic in > practice, so I'd default to max and let downstream systems override if > needed. Happy to hear more thoughts on this though, as it's hard to find > quality references in literature helping with CBO trade-offs in practice. > > It would be interesting to do a sanity check for other metadata handlers > and see if they support RelSubset or not, and see if they can grouped by > desidered properties. > > I am currently on PTO but I might take this on if nobody gets to it before > I do. > > Best regards, > Alessandro > > On Mon, Jun 1, 2026, 20:37 Julian Hyde <[email protected]> wrote: > >> Yes, it makes sense to use the RelSubset selectivity. I think that’s the >> main issue here, so let’s declare us at consensus. >> >> How that value is arrived at is a different matter. Let’s strive to create >> good, understandable, deterministic, numerically stable formulas for >> statistics. Your statement >> >>> considering that statistics propagation is anyway an >>> estimation/approximation >> >> isn't helpful if it lets us shrug and accept nondeterministic statistics >> estimates. A deterministic process for computing metadata is extremely >> desirable, and I believe it is achievable. >> >> Concretely, if we need to combine several estimates of selectivity, max >> and min are more stable than avg and median. My hunch is that avg-distinct >> is more stable than avg, and might be good enough if we have to combine >> estimates from several sources. >> >>> On May 30, 2026, at 10:47 PM, Alessandro Solimando < >> [email protected]> wrote: >>> >>> Hi Julian, >>> Plans belonging to the same RelSubset being part of the same equivalent >>> class, I would expect them to share the same selectivity, as they need to >>> filter exactly the same fraction of rows, right? >>> >>> But the problem is almost certainly "numerically unstable" and the order >> of >>> filters matter as we are dealing with floating point arithmetic. >>> >>> If that's correct, and considering that statistics propagation is anyway >> an >>> estimation/approximation, it should be reasonable to use the >> representative >>> of the equivalence class (possibly via RelSubest::getBestOrOrigin()) for >>> selectivity estimation. >>> >>> Does that make sense to you? >>> >>> Best regards, >>> Alessandro >>> >>> >>> On Fri, May 29, 2026, 22:31 Julian Hyde <[email protected]> wrote: >>> >>>> I don’t recall any reasons. >>>> >>>> Some metadata are easy because they are have an ordering. For example, >> if >>>> a predicate holds for one rel in a RelSubset then it applies for all. >>>> Therefore the RelSubset’s RelMdPredicates value should be the union of >> the >>>> predicates of all of its constituent rels. >>>> >>>> (Algebraically, such metadata have a partial ordering, an have an >>>> operation to combine values to make one value that is greater than >> either. >>>> I think that makes them monoids and a semilattice.) >>>> >>>> Selectivity doesn’t have those nice algebraic properties, so maybe we >>>> didn’t make a decision about “who should win” if there is a >> disagreement. >>>> >>>> Julian >>>> >>>> >>>>> On May 29, 2026, at 2:57 AM, Etienne Pelissier via dev < >>>> [email protected]> wrote: >>>>> >>>>> Me and my team are considering adding a getSelectivity(RelSubset, …) >>>>> override in our codebase and I'd like to check whether there's a known >>>>> reason core RelMdSelectivity doesn't do this — i.e. whether we'd be >>>> walking >>>>> into something the project has already considered and decided against. >>>>> >>>>> I checked https://lists.apache.org >>>>> < >>>> >> https://lists.apache.org/[email protected]:gte=0d:getSelectivity >>>>> >>>>> and https://issues.apache.org >>>>> < >>>> >> https://issues.apache.org/jira/browse/CALCITE-3298?jql=project%20%3D%20CALCITE%20AND%20text%20~%20getSelectivity >>>>> >>>>> and >>>>> don't think this subject has already been discussed there. >>>>> >>>>> We're planning this override because during Volcano exploration, >>>>> mq.getSelectivity(subset, >>>>> p) for a RelSubset falls to the RelNode catch-all in RelMdSelectivity >> and >>>>> returns RelMdUtil.guessSelectivity(predicate) — a pure function of the >>>>> predicate's syntactic shape (per-SqlKind factors multiplied across >>>>> conjuncts), with no dependency on the underlying RelNode. >>>>> >>>>> The override exists in Apache Flink and Apache Drill, which makes its >>>>> absence in core feel intentional rather than accidental. >>>>> >>>>> 1. Is the absence of a RelSubset handler in RelMdSelectivity >> deliberate? >>>>> 2. Are there pitfalls in the Flink/Drill-style override that we'd be >>>>> inheriting? Delegating to subset.getBestOrOriginal() seems like the >>>> obvious >>>>> shape, but I want to make sure I'm not missing a known footgun before >> we >>>>> ship it. >>>>> 3. If you've tried this in a Calcite-based engine and hit a problem, >> I'd >>>>> love to hear what. >>>>> >>>>> Not asking for any changes in core — just trying to sanity-check our >>>>> downstream decision before we commit to it. >>>>> >>>>> Thanks, >>>>> Etienne Pelissier >>>> >>>> >> >>
