2010YOUY01 commented on PR #21122:
URL: https://github.com/apache/datafusion/pull/21122#issuecomment-4572728825

   @asolimando The implementation plan looks great!
   
   Regarding unified API vs. separate APIs:
   
   1. I think the unified API approach is a better match for the underlying 
theory. Conceptually, what gets propagated through the expression tree is a 
**distribution summary**; different statistics such as NDV/null ratio are 
projections from that same distribution.
   
      In a naive implementation for numeric expressions, the propagated fact 
seems to be something like `UniformDistribution(min, max, null_ratio)`, and we 
use that distribution to derive selectivity for predicates.
   
      With separate APIs, the mental model may become that all statistic types 
are independent. That could make the design harder to extend in the future, 
especially if we want to handle skewed or correlated distributions.
   
      I now feel the StatisticsV2 design makes a lot of sense from this 
perspective. Maybe we can revisit that approach and fix the engineering issues 
from the previous attempt:
   
      * https://github.com/apache/datafusion/pull/22071
   
   2. The requirement for lazily computing only a subset of statistics also 
makes sense to me. We could try to encode this into a request-based API:
   
      ```
      estimate_stats(expr, context, request) -> response
      ```
   
      where the request says something like:
   
      ```
      I want: null_ratio, ndv, min/max, selectivity, ...
      with: cheap_stats_only / build_histogram / ...
      ```
   
      This probably still needs more thought to design cleanly, though.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to