adriangb opened a new pull request, #21157:
URL: https://github.com/apache/datafusion/pull/21157

   ## Summary
   
   - Introduces `StatisticsSource` trait: an expression-based async statistics 
API that accepts `&[Expr]` and returns `Vec<Option<ArrayRef>>`
   - Adds `ResolvedStatistics`: a `HashMap<Expr, ArrayRef>` cache that 
separates async data resolution from sync predicate evaluation
   - Adds `PruningPredicate::evaluate()`: sync evaluation against pre-resolved 
stats cache
   - Blanket impl bridges all existing `PruningStatistics` implementations 
automatically
   - Refactors `prune()` to delegate through `resolve_all_sync()` + 
`evaluate()`, validating the two-phase pattern end-to-end
   
   This design enables async statistics sources (external metastores, runtime 
sampling) while keeping evaluation synchronous for `Stream::poll_next()` 
contexts like `EarlyStoppingStream`. It also lays groundwork for struct field 
pruning (#21003) by accepting arbitrary `Expr` (e.g., 
`min(get_field(struct_col, 'field'))`).
   
   ## Test plan
   
   - [x] All 82 existing pruning tests pass unchanged
   - [x] 16 new tests covering: resolve helpers (min/max/count/InList/NOT IN), 
ResolvedStatistics cache, evaluate-matches-prune equivalence, missing cache 
entries → conservative keep
   - [x] Zero clippy warnings
   - [x] `datafusion-datasource-parquet` compiles unchanged
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to