adriangb opened a new pull request, #19639: URL: https://github.com/apache/datafusion/pull/19639
## Summary This PR implements cross-file tracking of filter selectivity in ParquetSource to adaptively reorder and demote low-selectivity filters, as discussed in https://github.com/apache/datafusion/issues/3463#issuecomment-3708398274. **Key changes:** - Add `SelectivityTracker` to track filter effectiveness across files using `ExprKey` wrapper for structural equality - Each `ParquetOpener` queries shared stats to partition filters into row filters (push down) vs post-scan filters (inline application) - Post-scan filters are added to projection, applied inline in stream via `apply_post_scan_filters()`, then filter columns are removed from output - `SelectivityUpdatingStream` wrapper updates tracker when stream completes - `build_row_filter_with_metrics()` returns per-filter metrics for selectivity tracking - Filters are reordered by observed effectiveness (most selective first) **Configuration:** - `parquet_options.filter_effectiveness_threshold` (default: 0.8) - Effectiveness = 1 - (rows_matched / rows_total) = fraction of rows filtered out - Filters with effectiveness < threshold are demoted to post-scan **Files added:** - `datafusion/datasource-parquet/src/selectivity.rs` - Core tracking infrastructure **Files modified:** - `opener.rs` - Filter partitioning, post-scan application, `SelectivityUpdatingStream` - `row_filter.rs` - `FilterMetrics`, `RowFilterWithMetrics`, effectiveness-based reordering - `source.rs` - `selectivity_tracker` field and builder methods - `config.rs` - `filter_effectiveness_threshold` config option ## Test plan - [x] Unit tests for `ExprKey` hash/eq consistency - [x] Unit tests for `SelectivityStats::effectiveness()` edge cases - [x] Unit tests for `SelectivityTracker::partition_filters()` threshold logic - [x] Existing test suite passes - [ ] Integration tests for post-scan filter application - [ ] End-to-end tests for adaptive behavior across files - [ ] Performance benchmarks 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
