2010YOUY01 commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3695386411
> * Should this be done at the via RecordBatches instead of custom Rust structs? Could we have something like `PhysicalExpr::prunable() -> Option<PrunablePhysicalExpr>` Where `PrunablePhysicalExpr::evaluate(RecordBatch)` and the `RecordBatch` has a `Union(IntermediateStatsArray, IntermediateResultArray)` or something. I feel that working with `RecordBatch`es is a worse / more annoying API but if it has a significant performance impact maybe it's what we need to do. Then we could evaluate all columns at once for all containers? I'm thinking the ideal design is somewhere in the middle: custom `PruningIntermediate` struct as the intermediate representation for statistics propagation, and it tries to use arrow `Array` as physical representation when possible. The reason is for simple stats like min/max, using `Array` can be a) Evaluate fast b) Stats propagation can be implemented with existing Arrow kernels easily. For complex statistics, we want more flexibility, `RecordBatch`es as IR can be awkward for such cases. For example for sets statistics, it's better to store `Vec<HashSet<ScalarValue>>` directly, and it's also possible to get extended to other statistics type we haven't thought about so far, like directly store bloom filter as column statistics. The initial POC did use Arrow `Array` for min/max, but for `PruningResult (KeepAll, SkipAll, Unknown)` and null stats it's using custom rust enums. I think maybe we can use `BooleanArray` for `PruningResult`: true->KeepAll, false->SkipAll, null->Unknown, to get the above-mentioned benefits. > > IIRC, the paper also mentioned some balancing about compile time for pruning. Would we also have some basic heuristic approach to do the tradeoff? > > I think this is related to the performance concern above -- if we had a vectorized evaluator we may not have to add such a heuristic I believe for range statistics like min/max, the vectorized evaluator should be fast enough. For sets statistics it's can get slow. So probably the pruning framework should be able to 1. Effectively short-circuit: if we can derive a pruning decision without `SetStats`, sets don't have to be evaluated 2. Runtime refinement: cut away useless sub-expressions from runtime feedback I don't plan to include such optimizations in the initial PR, but I want to make sure they're extendible as follow-ups. ---- Update: I have included two more sections in the issue for the clarifications: - `PruningIntermediate` is vectorized: the IR's physical design to evaluate on thousands of containers efficiently - Existing evaluate_bounds() API: what's the difference between this proposal and the existing stats propagation API And I’m working on a refined version to: * For `PruningIntermediate`, use Arrow `Array` as the physical representation when possible, and eliminate several custom Rust structs. * Make pruning evaluation easier to short-circuit. (if we can derive effective pruning decision with only range statistics, skip evaluating sets statistics) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
