crepererum commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3389753573
> I can't help but feel the current Distribution doesn't have many practical benefits -- specifically the idea of having mathemetical descriptions of value distributions is intellectually appealing, but I have never see actual query engines use it (because real data is never completely described by those theoretical distributions). Maybe I am missing something FWIW I do agree with this. For example take the range of values. Currently that's two different stat values `min` and `max`, but that should probably be encapsulated using 1 struct/enum. For `Distribution<ScalarValue>`, it's unlikely that you ever gonna use anything else than `Generic` because parquet -- or most other data sources -- give us it's really only a range with inclusive or exclusive bounds. So the entire enum is mostly unused. Then if we look at [`GenericDistribution`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/struct.GenericDistribution.html) and it's [constructor](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/enum.Distribution.html#method.new_generic) the issue is again that it requires knowledge like variance, median, and mean, which you likely never gonna know for most data sources. In fact if you have any filtered data source, then calculating the `median` is virtually impossible if you wanna do anything that is remotely performant. So that's another 75% of the interface gone/unusable. So what's kinda left is the [`Interval`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/interval_arithmetic/struct.Interval.html) type and the kinda nice API methods around it. So maybe we could use that? I also feel that there's a slight conflict of interest or at least two camps here: - **statistics always-correct optimizers:** Some people use statistics for optimizers like join ordering. There a wrong statistics often only results in slower execution, but never wrong results. That is kinda reflected in a lot of statistics calculation in the DF code base. - **correctness:** Some plan transformers (InfluxData for example has one) rely on the statistics that actually can make hard promises, i.e. "all values are FOR SURE in this range". In that case, you really wanna be picky about what the stats do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
