Dandandan opened a new pull request, #20686:
URL: https://github.com/apache/datafusion/pull/20686

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   Dynamic filter pushdown is a powerful optimization that prunes probe-side 
files/row groups based on the actual values in the build side. However, the 
current implementation only provides bounds after the entire build side is 
consumed. For large build sides, this delays pruning decisions.
   
   This PR enables early, approximate dynamic filtering by extracting min/max 
bounds from the build side's file-level statistics (e.g., Parquet file 
metadata) before the build side is fully consumed. These bounds are typically 
wider than the actual data bounds but are available immediately, allowing the 
probe side to start pruning files/row groups earlier. The filter is later 
refined with exact bounds when the build side is fully consumed.
   
   ## What changes are included in this PR?
   
   1. **New function `compute_bounds_from_statistics`**: Extracts min/max 
bounds from execution plan statistics for simple column reference expressions. 
Only uses `Precision::Exact` statistics to avoid over-pruning with inexact 
bounds.
   
   2. **New method `update_from_build_side_statistics`**: Updates the dynamic 
filter with initial bounds derived from build-side file statistics. This 
provides an early, approximate filter before the build side is fully consumed.
   
   3. **Enhanced `create_bounds_predicate`**: Added null-check guards to skip 
column bounds with null min/max values, allowing partial bounds (some columns 
with stats, others without) to be handled correctly.
   
   4. **Integration in `HashJoinExec`**: Calls 
`update_from_build_side_statistics` during the build accumulator initialization 
to provide early bounds from the build side's partition statistics.
   
   5. **Comprehensive test coverage**: Added unit tests for 
`compute_bounds_from_statistics` covering single/multi-column cases, 
absent/null statistics, partial bounds, and inexact statistics. Added 
integration tests in `dynamic_filter_file_stats.slt` verifying correctness and 
file pruning with Parquet file statistics.
   
   ## Are these changes tested?
   
   Yes. The PR includes:
   - Unit tests in `shared_bounds.rs` covering various scenarios 
(single/multi-column, absent stats, null values, partial bounds, inexact 
statistics)
   - Integration tests in `dynamic_filter_file_stats.slt` verifying correctness 
with Parquet files and confirming file-level pruning occurs (e.g., 3 files → 1 
matched via dynamic filter bounds)
   
   ## Are there any user-facing changes?
   
   This is an internal optimization that improves performance of hash joins 
with dynamic filter pushdown when the build side is a file scan. No API changes 
or user-facing configuration changes are required. The optimization is 
transparent and automatically applied when dynamic filter pushdown is enabled.
   
   https://claude.ai/code/session_019qN7na2BoTBDSy63HZ6u4K


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to