2010YOUY01 commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3691987383

   > This is an incredible writeup!
   > 
   > My understanding of the prior art on this is that we at one point added 
`PhysicalExpr::evaluate_bounds` (which is exactly what you are proposing in 
`propagate_range_stats`) and `Distribution` but these never made it to be 
widely used. I do not know exactly why this is in general, but I think in the 
case of Parquet row group / page stats evaluation it was mainly a performance 
concern: the current approach builds a modified expression tree and a 
RecordBatch then evaluates it, which in theory can be vectorized, etc. This has 
also been the blocker to adding tri-state logic in Parquet at least: because we 
use expressions like `BinaryExpr` the output is a `BooleanArray`, we can't 
distinguish between `KeepAll` and `Unknown`.
   > 
   > Some thoughts based on this:
   > 
   > * Could we add `Distribution` to `ColumnStats`? It could be useful to 
track alongside.
   > * Should this be done at the via RecordBatches instead of custom Rust 
structs? Could we have something like `PhysicalExpr::prunable() -> 
Option<PrunablePhysicalExpr>` Where 
PrunablePhysicalExpr::evaluate(RecordBatch)`and the`RecordBatch`has 
a`Union(IntermediateStatsArray, IntermediateResultArray)`or something. I feel 
that working with`RecordBatch`es is a worse / more annoying API but if it has a 
significant performance impact maybe it's what we need to do.
   > 
   > Overall though I 100% stand behind this effort 🚀
   
   - Regarding the existing stat propagation, my guess is they're intended to 
be implemented for cardinality estimation, and finally benefit join ordering (I 
don't have much historical context for this part, could someone confirm? 🤔 ), 
so they have fundamentally different requirements from the stats used for 
pruning (in this Proposal) -- For cardinality estimation it have to track 
complex distribution info, and the input don't have to be vectorized (for CE 
only have to consider 1 column, for pruning it's possible to handle thousands 
of src containers). By far I don't think unifying those two components can make 
the implementation better.
   - Regarding making `PruningIntermediate` a part of `PhysicalExpr` and 
eliminate custom Rust structs: I think it's a excellent point, let me 
experiment with this approach, and hopefully we can get some cleaner APIs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to