alamb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3693960061
> My understanding of the prior art on this is that we at one point added [PhysicalExpr::evaluate_bounds](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds) (which is exactly what you are proposing in propagate_range_stats) and Distribution but these never made it to be widely used. I do not know exactly why this is in general, but I think in the case of Parquet row group / page stats evaluation it was mainly a performance concern: Yes, this is my recollection too. Specifically imagine trying to prune 1000's of files -- with [PhysicalExpr::evaluate_bounds](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds) you have to call it 1000s of times, which will be really slow > the current approach builds a modified expression tree and a RecordBatch then evaluates it, which in theory can be vectorized, etc. To be clear, this is how PruningPredicate works, which is the key to making the evaluation fast -- it reuses all the optimized expression evaluation machinery. > IIRC, the paper also mentioned some balancing about compile time for pruning. Would we also have some basic heuristic approach to do the tradeoff? I think this is related to the performance concern above -- if we had a vectorized evaluator we may not have to add such a heuristic -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
