adriangb commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3691561747

   This is an incredible writeup!
   
   My understanding of the prior art on this is that we at one point added 
`PhysicalExpr::evaluate_bounds` (which is exactly what you are proposing in 
`propagate_range_stats`) and `Distribution` but these never made it to be 
widely used. I do not know exactly why this is in general, but I think in the 
case of Parquet row group / page stats evaluation it was mainly a performance 
concern: the current approach builds a modified expression tree and a 
RecordBatch then evaluates it, which in theory can be vectorized, etc. This has 
also been the blocker to adding tri-state logic in Parquet at least: because we 
use expressions like `BinaryExpr` the output is a `BooleanArray`, we can't 
distinguish between `KeepAll` and `Unknown`.
   
   Some thoughts based on this:
   - Could we add `Distribution` to `ColumnStats`? It could be useful to track 
alongside.
   - Should this be done at the via RecordBatches instead of custom Rust 
structs? Could we have something like `PhysicalExpr::prunable() -> 
Option<PrunablePhysicalExpr>` Where 
PrunablePhysicalExpr::evaluate(RecordBatch)` and the `RecordBatch` has a 
`Union(IntermediateStatsArray, IntermediateResultArray)` or something. I feel 
that working with `RecordBatch`es is a worse / more annoying API but if it has 
a significant performance impact maybe it's what we need to do.
   
   Overall though I 100% stand behind this effort 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to