Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Fri, 10 Oct 2025 05:10:17 -0700


crepererum commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3389753573

> I can't help but feel the current Distribution doesn't have many practical
benefits -- specifically the idea of having mathemetical descriptions of value
distributions is intellectually appealing, but I have never see actual query
engines use it (because real data is never completely described by those
theoretical distributions). Maybe I am missing something

FWIW I do agree with this. For example take the range of values. Currently
that's two different stat values `min` and `max`, but that should probably be
encapsulated using 1 struct/enum. For `Distribution<ScalarValue>`, it's
unlikely that you ever gonna use anything else than `Generic` because parquet
-- or most other data sources -- give us it's really only a range with
inclusive or exclusive bounds. So the entire enum is mostly unused.

Then if we look at
[`GenericDistribution`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/struct.GenericDistribution.html)
and it's
[constructor](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/enum.Distribution.html#method.new_generic)
the issue is again that it requires knowledge like variance, median, and mean,
which you likely never gonna know for most data sources. In fact if you have
any filtered data source, then calculating the `median` is virtually impossible
if you wanna do anything that is remotely performant. So that's another 75% of
the interface gone/unusable.

So what's kinda left is the
[`Interval`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)
type and the kinda nice API methods around it. So maybe we could use that?

I also feel that there's a slight conflict of interest or at least two camps
here:

- **statistics always-correct optimizers:** Some people use statistics for
optimizers like join ordering. There a wrong statistics often only results in
slower execution, but never wrong results. That is kinda reflected in a lot of
statistics calculation in the DF code base.
- **correctness:** Some plan transformers (InfluxData for example has one)
rely on the statistics that actually can make hard promises, i.e. "all values
are FOR SURE in this range". In that case, you really wanna be picky about what
the stats do.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to