Re: [I] Proposal: Prune complex predicates by propagating column statistics [datafusion]

via GitHub Sun, 28 Dec 2025 20:16:29 -0800


2010YOUY01 commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3695386411


   > * Should this be done at the via RecordBatches instead of custom Rust 
structs? Could we have something like `PhysicalExpr::prunable() -> 
Option<PrunablePhysicalExpr>` Where 
`PrunablePhysicalExpr::evaluate(RecordBatch)` and the `RecordBatch` has a 
`Union(IntermediateStatsArray, IntermediateResultArray)` or something. I feel 
that working with `RecordBatch`es is a worse / more annoying API but if it has 
a significant performance impact maybe it's what we need to do. Then we could 
evaluate all columns at once for all containers?
   
   I'm thinking the ideal design is somewhere in the middle: custom 
`PruningIntermediate` struct as the intermediate representation for statistics 
propagation, and it tries to use arrow `Array` as physical representation when 
possible.
   
   The reason is for simple stats like min/max, using `Array` can be a) 
Evaluate fast b) Stats propagation can be implemented with existing Arrow 
kernels easily.
   For complex statistics, we want more flexibility, `RecordBatch`es as IR can 
be awkward for such cases. For example for sets statistics, it's better to 
store `Vec<HashSet<ScalarValue>>` directly, and it's also possible to get 
extended to other statistics type we haven't thought about so far, like 
directly store bloom filter as column statistics.
   
   The initial POC did use Arrow `Array` for min/max, but for `PruningResult 
(KeepAll, SkipAll, Unknown)` and null stats it's using custom rust enums.
   I think maybe we can use `BooleanArray` for `PruningResult`: true->KeepAll, 
false->SkipAll, null->Unknown, to get the above-mentioned benefits.
   
   
   
   > > IIRC, the paper also mentioned some balancing about compile time for 
pruning. Would we also have some basic heuristic approach to do the tradeoff?
   > 
   > I think this is related to the performance concern above -- if we had a 
vectorized evaluator we may not have to add such a heuristic
   
   I believe for range statistics like min/max, the vectorized evaluator should 
be fast enough. For sets statistics it's can get slow.
   
   So probably the pruning framework should be able to
   1. Effectively short-circuit: if we can derive a pruning decision without 
`SetStats`, sets don't have to be evaluated
   2. Runtime refinement: cut away useless sub-expressions from runtime feedback
   
   I don't plan to include such optimizations in the initial PR, but I want to 
make sure they're extendible as follow-ups.
   
   ----
   
   Update: I have included two more sections in the issue for the 
clarifications:
   - `PruningIntermediate` is vectorized: the IR's physical design to evaluate 
on thousands of containers efficiently
   - Existing evaluate_bounds() API: what's the difference between this 
proposal and the existing stats propagation API
   
   And I’m working on a refined version to:
   
   * For `PruningIntermediate`, use Arrow `Array` as the physical 
representation when possible, and eliminate several custom Rust structs.
   * Make pruning evaluation easier to short-circuit. (if we can derive 
effective pruning decision with only range statistics, skip evaluating sets 
statistics)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Proposal: Prune complex predicates by propagating column statistics [datafusion]

Reply via email to