Yeah, avg-distinct is almost certainly NOT the right thing. I don’t have the 
answer. My best advice is to strive a formula that is simple, deterministic, 
numerically stable and accurate (a good estimate for the actual selectivity of 
the query). Find a balance of those 4 factors.

> On Jun 1, 2026, at 2:26 PM, Alessandro Solimando 
> <[email protected]> wrote:
> 
> Thanks Julian, approximation error is indeed only one of dimensions of
> interest, and determinism is probably an even more important property in
> this case.
> 
> If variance were guaranteed to be "limited", then I'd pick either min or
> max, for the reasons you suggest, and select the average of distinct values
> otherwise.
> 
> But I believe avg-distinct is problematic because it's not compositional
> (which I mean as avg(A U B) != avg(avg(A), avg(B)), while this is true for
> min/max, and metadata handlers do need to compose statistics from
> children's stats ("aggregated", so they need to be local).
> 
> Then between min and max it's another hard choice, and it depends on how
> the statistics are consumed and used, which also determines if it's more
> problematic to overestimate or underestimate.
> 
> My (unproven) feeling is that underestimation is more problematic in
> practice, so I'd default to max and let downstream systems override if
> needed. Happy to hear more thoughts on this though, as it's hard to find
> quality references in literature helping with CBO trade-offs in practice.
> 
> It would be interesting to do a sanity check for other metadata handlers
> and see if they support RelSubset or not, and see if they can grouped by
> desidered properties.
> 
> I am currently on PTO but I might take this on if nobody gets to it before
> I do.
> 
> Best regards,
> Alessandro
> 
> On Mon, Jun 1, 2026, 20:37 Julian Hyde <[email protected]> wrote:
> 
>> Yes, it makes sense to use the RelSubset selectivity. I think that’s the
>> main issue here, so let’s declare us at consensus.
>> 
>> How that value is arrived at is a different matter. Let’s strive to create
>> good, understandable, deterministic, numerically stable formulas for
>> statistics. Your statement
>> 
>>> considering that statistics propagation is anyway an
>>> estimation/approximation
>> 
>> isn't helpful if it lets us shrug and accept nondeterministic statistics
>> estimates. A deterministic process for computing metadata is extremely
>> desirable, and I believe it is achievable.
>> 
>> Concretely, if we need to combine several estimates of selectivity, max
>> and min are more stable than avg and median. My hunch is that avg-distinct
>> is more stable than avg, and might be good enough if we have to combine
>> estimates from several sources.
>> 
>>> On May 30, 2026, at 10:47 PM, Alessandro Solimando <
>> [email protected]> wrote:
>>> 
>>> Hi Julian,
>>> Plans belonging to the same RelSubset being part of the same equivalent
>>> class, I would expect them to share the same selectivity, as they need to
>>> filter exactly the same fraction of rows, right?
>>> 
>>> But the problem is almost certainly "numerically unstable" and the order
>> of
>>> filters matter as we are dealing with floating point arithmetic.
>>> 
>>> If that's correct, and considering that statistics propagation is anyway
>> an
>>> estimation/approximation, it should be reasonable to use the
>> representative
>>> of the equivalence class (possibly via RelSubest::getBestOrOrigin()) for
>>> selectivity estimation.
>>> 
>>> Does that make sense to you?
>>> 
>>> Best regards,
>>> Alessandro
>>> 
>>> 
>>> On Fri, May 29, 2026, 22:31 Julian Hyde <[email protected]> wrote:
>>> 
>>>> I don’t recall any reasons.
>>>> 
>>>> Some metadata are easy because they are have an ordering. For example,
>> if
>>>> a predicate holds for one rel in a RelSubset then it applies for all.
>>>> Therefore the RelSubset’s RelMdPredicates value should be the union of
>> the
>>>> predicates of all of its constituent rels.
>>>> 
>>>> (Algebraically, such metadata have a partial ordering, an have an
>>>> operation to combine values to make one value that is greater than
>> either.
>>>> I think that makes them monoids and a semilattice.)
>>>> 
>>>> Selectivity doesn’t have those nice algebraic properties, so maybe we
>>>> didn’t make a decision about “who should win” if there is a
>> disagreement.
>>>> 
>>>> Julian
>>>> 
>>>> 
>>>>> On May 29, 2026, at 2:57 AM, Etienne Pelissier via dev <
>>>> [email protected]> wrote:
>>>>> 
>>>>> Me and my team are considering adding a getSelectivity(RelSubset, …)
>>>>> override in our codebase and I'd like to check whether there's a known
>>>>> reason core RelMdSelectivity doesn't do this — i.e. whether we'd be
>>>> walking
>>>>> into something the project has already considered and decided against.
>>>>> 
>>>>> I checked https://lists.apache.org
>>>>> <
>>>> 
>> https://lists.apache.org/[email protected]:gte=0d:getSelectivity
>>>>> 
>>>>> and https://issues.apache.org
>>>>> <
>>>> 
>> https://issues.apache.org/jira/browse/CALCITE-3298?jql=project%20%3D%20CALCITE%20AND%20text%20~%20getSelectivity
>>>>> 
>>>>> and
>>>>> don't think this subject has already been discussed there.
>>>>> 
>>>>> We're planning this override because during Volcano exploration,
>>>>> mq.getSelectivity(subset,
>>>>> p) for a RelSubset falls to the RelNode catch-all in RelMdSelectivity
>> and
>>>>> returns RelMdUtil.guessSelectivity(predicate) — a pure function of the
>>>>> predicate's syntactic shape (per-SqlKind factors multiplied across
>>>>> conjuncts), with no dependency on the underlying RelNode.
>>>>> 
>>>>> The override exists in Apache Flink and Apache Drill, which makes its
>>>>> absence in core feel intentional rather than accidental.
>>>>> 
>>>>> 1. Is the absence of a RelSubset handler in RelMdSelectivity
>> deliberate?
>>>>> 2. Are there pitfalls in the Flink/Drill-style override that we'd be
>>>>> inheriting? Delegating to subset.getBestOrOriginal() seems like the
>>>> obvious
>>>>> shape, but I want to make sure I'm not missing a known footgun before
>> we
>>>>> ship it.
>>>>> 3. If you've tried this in a Calcite-based engine and hit a problem,
>> I'd
>>>>> love to hear what.
>>>>> 
>>>>> Not asking for any changes in core — just trying to sanity-check our
>>>>> downstream decision before we commit to it.
>>>>> 
>>>>> Thanks,
>>>>> Etienne Pelissier
>>>> 
>>>> 
>> 
>> 

Reply via email to