Thanks Julian, approximation error is indeed only one of dimensions of
interest, and determinism is probably an even more important property in
this case.
If variance were guaranteed to be "limited", then I'd pick either min or
max, for the reasons you suggest, and select the average of distinct values
otherwise.
But I believe avg-distinct is problematic because it's not compositional
(which I mean as avg(A U B) != avg(avg(A), avg(B)), while this is true for
min/max, and metadata handlers do need to compose statistics from
children's stats ("aggregated", so they need to be local).
Then between min and max it's another hard choice, and it depends on how
the statistics are consumed and used, which also determines if it's more
problematic to overestimate or underestimate.
My (unproven) feeling is that underestimation is more problematic in
practice, so I'd default to max and let downstream systems override if
needed. Happy to hear more thoughts on this though, as it's hard to find
quality references in literature helping with CBO trade-offs in practice.
It would be interesting to do a sanity check for other metadata handlers
and see if they support RelSubset or not, and see if they can grouped by
desidered properties.
I am currently on PTO but I might take this on if nobody gets to it before
I do.
Best regards,
Alessandro
On Mon, Jun 1, 2026, 20:37 Julian Hyde <[email protected]> wrote:
> Yes, it makes sense to use the RelSubset selectivity. I think that’s the
> main issue here, so let’s declare us at consensus.
>
> How that value is arrived at is a different matter. Let’s strive to create
> good, understandable, deterministic, numerically stable formulas for
> statistics. Your statement
>
> > considering that statistics propagation is anyway an
> > estimation/approximation
>
> isn't helpful if it lets us shrug and accept nondeterministic statistics
> estimates. A deterministic process for computing metadata is extremely
> desirable, and I believe it is achievable.
>
> Concretely, if we need to combine several estimates of selectivity, max
> and min are more stable than avg and median. My hunch is that avg-distinct
> is more stable than avg, and might be good enough if we have to combine
> estimates from several sources.
>
> > On May 30, 2026, at 10:47 PM, Alessandro Solimando <
> [email protected]> wrote:
> >
> > Hi Julian,
> > Plans belonging to the same RelSubset being part of the same equivalent
> > class, I would expect them to share the same selectivity, as they need to
> > filter exactly the same fraction of rows, right?
> >
> > But the problem is almost certainly "numerically unstable" and the order
> of
> > filters matter as we are dealing with floating point arithmetic.
> >
> > If that's correct, and considering that statistics propagation is anyway
> an
> > estimation/approximation, it should be reasonable to use the
> representative
> > of the equivalence class (possibly via RelSubest::getBestOrOrigin()) for
> > selectivity estimation.
> >
> > Does that make sense to you?
> >
> > Best regards,
> > Alessandro
> >
> >
> > On Fri, May 29, 2026, 22:31 Julian Hyde <[email protected]> wrote:
> >
> >> I don’t recall any reasons.
> >>
> >> Some metadata are easy because they are have an ordering. For example,
> if
> >> a predicate holds for one rel in a RelSubset then it applies for all.
> >> Therefore the RelSubset’s RelMdPredicates value should be the union of
> the
> >> predicates of all of its constituent rels.
> >>
> >> (Algebraically, such metadata have a partial ordering, an have an
> >> operation to combine values to make one value that is greater than
> either.
> >> I think that makes them monoids and a semilattice.)
> >>
> >> Selectivity doesn’t have those nice algebraic properties, so maybe we
> >> didn’t make a decision about “who should win” if there is a
> disagreement.
> >>
> >> Julian
> >>
> >>
> >>> On May 29, 2026, at 2:57 AM, Etienne Pelissier via dev <
> >> [email protected]> wrote:
> >>>
> >>> Me and my team are considering adding a getSelectivity(RelSubset, …)
> >>> override in our codebase and I'd like to check whether there's a known
> >>> reason core RelMdSelectivity doesn't do this — i.e. whether we'd be
> >> walking
> >>> into something the project has already considered and decided against.
> >>>
> >>> I checked https://lists.apache.org
> >>> <
> >>
> https://lists.apache.org/[email protected]:gte=0d:getSelectivity
> >>>
> >>> and https://issues.apache.org
> >>> <
> >>
> https://issues.apache.org/jira/browse/CALCITE-3298?jql=project%20%3D%20CALCITE%20AND%20text%20~%20getSelectivity
> >>>
> >>> and
> >>> don't think this subject has already been discussed there.
> >>>
> >>> We're planning this override because during Volcano exploration,
> >>> mq.getSelectivity(subset,
> >>> p) for a RelSubset falls to the RelNode catch-all in RelMdSelectivity
> and
> >>> returns RelMdUtil.guessSelectivity(predicate) — a pure function of the
> >>> predicate's syntactic shape (per-SqlKind factors multiplied across
> >>> conjuncts), with no dependency on the underlying RelNode.
> >>>
> >>> The override exists in Apache Flink and Apache Drill, which makes its
> >>> absence in core feel intentional rather than accidental.
> >>>
> >>> 1. Is the absence of a RelSubset handler in RelMdSelectivity
> deliberate?
> >>> 2. Are there pitfalls in the Flink/Drill-style override that we'd be
> >>> inheriting? Delegating to subset.getBestOrOriginal() seems like the
> >> obvious
> >>> shape, but I want to make sure I'm not missing a known footgun before
> we
> >>> ship it.
> >>> 3. If you've tried this in a Calcite-based engine and hit a problem,
> I'd
> >>> love to hear what.
> >>>
> >>> Not asking for any changes in core — just trying to sanity-check our
> >>> downstream decision before we commit to it.
> >>>
> >>> Thanks,
> >>> Etienne Pelissier
> >>
> >>
>
>