adriangb opened a new pull request, #19639:
URL: https://github.com/apache/datafusion/pull/19639

   ## Summary
   
   This PR implements cross-file tracking of filter selectivity in 
ParquetSource to adaptively reorder and demote low-selectivity filters, as 
discussed in 
https://github.com/apache/datafusion/issues/3463#issuecomment-3708398274.
   
   **Key changes:**
   - Add `SelectivityTracker` to track filter effectiveness across files using 
`ExprKey` wrapper for structural equality
   - Each `ParquetOpener` queries shared stats to partition filters into row 
filters (push down) vs post-scan filters (inline application)
   - Post-scan filters are added to projection, applied inline in stream via 
`apply_post_scan_filters()`, then filter columns are removed from output
   - `SelectivityUpdatingStream` wrapper updates tracker when stream completes
   - `build_row_filter_with_metrics()` returns per-filter metrics for 
selectivity tracking
   - Filters are reordered by observed effectiveness (most selective first)
   
   **Configuration:**
   - `parquet_options.filter_effectiveness_threshold` (default: 0.8)
   - Effectiveness = 1 - (rows_matched / rows_total) = fraction of rows 
filtered out
   - Filters with effectiveness < threshold are demoted to post-scan
   
   **Files added:**
   - `datafusion/datasource-parquet/src/selectivity.rs` - Core tracking 
infrastructure
   
   **Files modified:**
   - `opener.rs` - Filter partitioning, post-scan application, 
`SelectivityUpdatingStream`
   - `row_filter.rs` - `FilterMetrics`, `RowFilterWithMetrics`, 
effectiveness-based reordering
   - `source.rs` - `selectivity_tracker` field and builder methods
   - `config.rs` - `filter_effectiveness_threshold` config option
   
   ## Test plan
   
   - [x] Unit tests for `ExprKey` hash/eq consistency
   - [x] Unit tests for `SelectivityStats::effectiveness()` edge cases
   - [x] Unit tests for `SelectivityTracker::partition_filters()` threshold 
logic
   - [x] Existing test suite passes
   - [ ] Integration tests for post-scan filter application
   - [ ] End-to-end tests for adaptive behavior across files
   - [ ] Performance benchmarks
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to