Dandandan opened a new pull request, #20686: URL: https://github.com/apache/datafusion/pull/20686
## Which issue does this PR close? - Closes #. ## Rationale for this change Dynamic filter pushdown is a powerful optimization that prunes probe-side files/row groups based on the actual values in the build side. However, the current implementation only provides bounds after the entire build side is consumed. For large build sides, this delays pruning decisions. This PR enables early, approximate dynamic filtering by extracting min/max bounds from the build side's file-level statistics (e.g., Parquet file metadata) before the build side is fully consumed. These bounds are typically wider than the actual data bounds but are available immediately, allowing the probe side to start pruning files/row groups earlier. The filter is later refined with exact bounds when the build side is fully consumed. ## What changes are included in this PR? 1. **New function `compute_bounds_from_statistics`**: Extracts min/max bounds from execution plan statistics for simple column reference expressions. Only uses `Precision::Exact` statistics to avoid over-pruning with inexact bounds. 2. **New method `update_from_build_side_statistics`**: Updates the dynamic filter with initial bounds derived from build-side file statistics. This provides an early, approximate filter before the build side is fully consumed. 3. **Enhanced `create_bounds_predicate`**: Added null-check guards to skip column bounds with null min/max values, allowing partial bounds (some columns with stats, others without) to be handled correctly. 4. **Integration in `HashJoinExec`**: Calls `update_from_build_side_statistics` during the build accumulator initialization to provide early bounds from the build side's partition statistics. 5. **Comprehensive test coverage**: Added unit tests for `compute_bounds_from_statistics` covering single/multi-column cases, absent/null statistics, partial bounds, and inexact statistics. Added integration tests in `dynamic_filter_file_stats.slt` verifying correctness and file pruning with Parquet file statistics. ## Are these changes tested? Yes. The PR includes: - Unit tests in `shared_bounds.rs` covering various scenarios (single/multi-column, absent stats, null values, partial bounds, inexact statistics) - Integration tests in `dynamic_filter_file_stats.slt` verifying correctness with Parquet files and confirming file-level pruning occurs (e.g., 3 files → 1 matched via dynamic filter bounds) ## Are there any user-facing changes? This is an internal optimization that improves performance of hash joins with dynamic filter pushdown when the build side is a file scan. No API changes or user-facing configuration changes are required. The optimization is transparent and automatically applied when dynamic filter pushdown is enabled. https://claude.ai/code/session_019qN7na2BoTBDSy63HZ6u4K -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
