2010YOUY01 commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3691987383
> This is an incredible writeup! > > My understanding of the prior art on this is that we at one point added `PhysicalExpr::evaluate_bounds` (which is exactly what you are proposing in `propagate_range_stats`) and `Distribution` but these never made it to be widely used. I do not know exactly why this is in general, but I think in the case of Parquet row group / page stats evaluation it was mainly a performance concern: the current approach builds a modified expression tree and a RecordBatch then evaluates it, which in theory can be vectorized, etc. This has also been the blocker to adding tri-state logic in Parquet at least: because we use expressions like `BinaryExpr` the output is a `BooleanArray`, we can't distinguish between `KeepAll` and `Unknown`. > > Some thoughts based on this: > > * Could we add `Distribution` to `ColumnStats`? It could be useful to track alongside. > * Should this be done at the via RecordBatches instead of custom Rust structs? Could we have something like `PhysicalExpr::prunable() -> Option<PrunablePhysicalExpr>` Where PrunablePhysicalExpr::evaluate(RecordBatch)`and the`RecordBatch`has a`Union(IntermediateStatsArray, IntermediateResultArray)`or something. I feel that working with`RecordBatch`es is a worse / more annoying API but if it has a significant performance impact maybe it's what we need to do. > > Overall though I 100% stand behind this effort 🚀 - Regarding the existing stat propagation, my guess is they're intended to be implemented for cardinality estimation, and finally benefit join ordering (I don't have much historical context for this part, could someone confirm? 🤔 ), so they have fundamentally different requirements from the stats used for pruning (in this Proposal) -- For cardinality estimation it have to track complex distribution info, and the input don't have to be vectorized (for CE only have to consider 1 column, for pruning it's possible to handle thousands of src containers). By far I don't think unifying those two components can make the implementation better. - Regarding making `PruningIntermediate` a part of `PhysicalExpr` and eliminate custom Rust structs: I think it's a excellent point, let me experiment with this approach, and hopefully we can get some cleaner APIs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
