crepererum commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3389753573

   > I can't help but feel the current Distribution doesn't have many practical 
benefits -- specifically the idea of having mathemetical descriptions of value 
distributions is intellectually appealing, but I have never see actual query 
engines use it (because real data is never completely described by those 
theoretical distributions). Maybe I am missing something
   
   FWIW I do agree with this. For example take the range of values. Currently 
that's two different stat values `min` and `max`, but that should probably be 
encapsulated using 1 struct/enum. For `Distribution<ScalarValue>`, it's 
unlikely that you ever gonna use anything else than `Generic` because parquet 
-- or most other data sources -- give us it's really only a range with 
inclusive or exclusive bounds. So the entire enum is mostly unused.
   
   Then if we look at 
[`GenericDistribution`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/struct.GenericDistribution.html)
 and it's 
[constructor](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/statistics/enum.Distribution.html#method.new_generic)
 the issue is again that it requires knowledge like variance, median, and mean, 
which you likely never gonna know for most data sources. In fact if you have 
any filtered data source, then calculating the `median` is virtually impossible 
if you wanna do anything that is remotely performant. So that's another 75% of 
the interface gone/unusable.
   
   So what's kinda left is the 
[`Interval`](https://docs.rs/datafusion/50.0.0/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)
 type and the kinda nice API methods around it. So maybe we could use that?
   
   I also feel that there's a slight conflict of interest or at least two camps 
here:
   
   - **statistics always-correct optimizers:** Some people use statistics for 
optimizers like join ordering. There a wrong statistics often only results in 
slower execution, but never wrong results. That is kinda reflected in a lot of 
statistics calculation in the DF code base.
   - **correctness:** Some plan transformers (InfluxData for example has one) 
rely on the statistics that actually can make hard promises, i.e. "all values 
are FOR SURE in this range". In that case, you really wanna be picky about what 
the stats do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to