ozankabak commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2545758160
I like the incremental approach and the ultimate desiderata on functionality, but I have some concerns on the "how" part. I will share my thoughts below, hopefully we will get to a neat design that we will not need to change/mess with again relatively soon. ## Precision There are two related, but independent concepts we need to track here: 1. [Summary statistics](https://en.wikipedia.org/wiki/Summary_statistics); i.e. estimations about location and dispersion (spread). 2. Definitive mathematical facts, such as the [range](https://en.wikipedia.org/wiki/Range_of_a_function) or a guaranteed superset thereof. In plain English, "hard facts" that can never be violated. (1) and (2) have different use cases. We typically use (1) for optimization (join algorithm and/or side selection is a good example), and we end up with performance gains when estimations are accurate. If not, we may end up with bad performance, but we shouldn't get incorrect results. However, we use (2) for more critical decisions, like pruning data structures (e.g. trimming hash tables because we know a match has become impossible for a section of the table). A problem in (2) causes incorrect results to be generated. (1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix. ## Column statistics vs Expression evaluation Evaluating an expression for bounds is a "low-level" concept - a slight generalization of normal expression evaluation where one uses interval arithmetic (or another range computation technique) instead of ordinary arithmetic. There are many use cases where this is what is required, and other statistical concepts such as distinct count and/or null count is not even applicable. We actually have use cases such as this at Synnada. One can discuss what the range of an expression is without any dataset at hand -- just ranges of each symbol is enough. On the other hand, computing column statistics is a "higher-level" task where computing the range is just one component of what is sought after. All in all, I don't currently think merging these things is a good idea. I see that a mechanism for "stats in -> stats out" kind of computations seems to be necessary, but IMO any such mechanism should use `evaluate_bounds` as a subroutine. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
