Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Mon, 16 Dec 2024 06:21:45 -0800


ozankabak commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2545758160


   I like the incremental approach and the ultimate desiderata on 
functionality, but I have some concerns on the "how" part. I will share my 
thoughts below, hopefully we will get to a neat design that we will not need to 
change/mess with again relatively soon.
   
   ## Precision
   
   There are two related, but independent concepts we need to track here:
   1. [Summary statistics](https://en.wikipedia.org/wiki/Summary_statistics); 
i.e. estimations about location and dispersion (spread).
   2. Definitive mathematical facts, such as the 
[range](https://en.wikipedia.org/wiki/Range_of_a_function) or a guaranteed 
superset thereof. In plain English, "hard facts" that can never be violated.
   
   (1) and (2) have different use cases. We typically use (1) for optimization 
(join algorithm and/or side selection is a good example), and we end up with 
performance gains when estimations are accurate. If not, we may end up with bad 
performance, but we shouldn't get incorrect results. However, we use (2) for 
more critical decisions, like pruning data structures (e.g. trimming hash 
tables because we know a match has become impossible for a section of the 
table). A problem in (2) causes incorrect results to be generated.
   
   (1) and (2) are not mutually exclusive. Sometimes we have info on both, 
sometimes only one but not the other, sometimes we don't know anything about 
either. The type definition as stated currently doesn't accommodate this, but 
this is an easy fix.
   
   ## Column statistics vs Expression evaluation
   
   Evaluating an expression for bounds is a "low-level" concept - a slight 
generalization of normal expression evaluation where one uses interval 
arithmetic (or another range computation technique) instead of ordinary 
arithmetic. There are many use cases where this is what is required, and other 
statistical concepts such as distinct count and/or null count is not even 
applicable. We actually have use cases such as this at Synnada. One can discuss 
what the range of an expression is without any dataset at hand -- just ranges 
of each symbol is enough.
   
   On the other hand, computing column statistics is a "higher-level" task 
where computing the range is just one component of what is sought after. All in 
all, I don't currently think merging these things is a good idea. I see that a 
mechanism for "stats in -> stats out" kind of computations seems to be 
necessary, but IMO any such mechanism should use `evaluate_bounds` as a 
subroutine.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to