2010YOUY01 commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4572728825
@asolimando The implementation plan looks great!
Regarding unified API vs. separate APIs:
1. I think the unified API approach is a better match for the underlying
theory. Conceptually, what gets propagated through the expression tree is a
**distribution summary**; different statistics such as NDV/null ratio are
projections from that same distribution.
In a naive implementation for numeric expressions, the propagated fact
seems to be something like `UniformDistribution(min, max, null_ratio)`, and we
use that distribution to derive selectivity for predicates.
With separate APIs, the mental model may become that all statistic types
are independent. That could make the design harder to extend in the future,
especially if we want to handle skewed or correlated distributions.
I now feel the StatisticsV2 design makes a lot of sense from this
perspective. Maybe we can revisit that approach and fix the engineering issues
from the previous attempt:
* https://github.com/apache/datafusion/pull/22071
2. The requirement for lazily computing only a subset of statistics also
makes sense to me. We could try to encode this into a request-based API:
```
estimate_stats(expr, context, request) -> response
```
where the request says something like:
```
I want: null_ratio, ndv, min/max, selectivity, ...
with: cheap_stats_only / build_histogram / ...
```
This probably still needs more thought to design cleanly, though.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]