ozankabak commented on PR #19609:
URL: https://github.com/apache/datafusion/pull/19609#issuecomment-3727520370

   > If `evaluate_statistics()` and `propagate_statistics()` are intended to 
replace `evaluate_bounds()` and `propagate_constraints()`, and the latter two 
are no longer used inside the DataFusion core, should we simply deprecate them?
   
   They are not. The latter are the simplest building blocks that calculate 
ranges, the former are designed to use them to generate richer statistical 
information. The former are not used yet in the codebase, because we 
unfortunately didn't have time to work on it over the last 6 months or so, but 
have an intention to come back and finish the job.
   
   I didn't have time yet to look at this big PR, but I looked at the issue and 
design. As a general thought I think replacing the 
`evaluate_bounds`/`propagate_constraints` duo will not be easy. The `Interval` 
library makes very careful calculations w.r.t. things like rounding (with 
floats etc.), because their results are used to take branches. Support for such 
rounding-aware calculations were not present in `arrow-rs` at the time when we 
created these APIs (and still not here AFAICT).
   
   However, we have two things:
   1. Rigor (w.r.t. containment) in bounds is a real need in some use cases 
(like join pruning).
   2. Vectorization is a real need in other use cases (like Parquet pruning).
   
   A good first step is to add `evaluate_bounds_vectorized`, and note these 
nuances in the docs. Then, if you think it is possible to somehow achieve 
containment rigor w.r.t. rounding in the vectorized version too, we can try our 
hand at that, and then implement its `propagate_constraints` counterpart. Only 
at that stage we can remove the `evaluate_bounds`/`propagate_constraints` duo. 
Then, `evaluate_statistics()` and `propagate_statistics()` can migrate to the 
new API.
   
   However, my hunch is that this won't be possible for the time being, and we 
will need to a new API that does vectorized bounds, and make it clear in the 
docs that we need the two APIs to cater to 1 and 2.
   
   I hope this context helps @2010YOUY01 and @alamb


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to