adriangb opened a new pull request, #20092: URL: https://github.com/apache/datafusion/pull/20092
## Which issue does this PR close? Related to: - https://github.com/apache/datafusion/issues/19894 - Unified `TableScan.filters` representation - https://github.com/apache/datafusion/issues/19950 - `UPDATE ...FROM` bug (filter extraction improvements) - https://github.com/apache/datafusion/pull/20091 - Complementary work on consolidated TableScan representation (projections as expressions) ## Rationale Currently, the optimizer calls `supports_filters_pushdown()` to classify filters during logical optimization. This results in a **split representation** where: - Exact/Inexact filters go to `TableScan.filters` - Unsupported/Inexact/Volatile filters stay as `Filter` nodes above the scan This creates several problems (as described in #19894): - **Filter duplication risk**: The same predicate may exist in both a Filter node and TableScan.filters - **Semantic confusion**: Unclear which filters are "pushed down" vs. "logical" - **Implementation burden**: DML operations must collect filters from multiple locations - **Multi-table safety hazards**: UPDATE...FROM scenarios become fragile ## What changes are included in this PR? This PR moves ALL filter expressions to `TableScan.filters` during logical optimization, deferring classification (Exact/Inexact/Unsupported) to the physical planner. ### Changes to `push_down_filter.rs`: - Simplified TableScan case to push ALL filters (except scalar subqueries) to `TableScan.filters` - Removed filter classification logic (now handled by physical planner) ### Changes to `physical_planner.rs`: - Enhanced TableScan handler to: - Classify filters using `supports_filters_pushdown()` - Create `FilterExec` for Unsupported/Inexact/Volatile filters - Handle projection expansion when filters need columns not in user's projection - Apply limits correctly when post-filtering is needed - Added `compute_scan_projection_with_filters()` helper - Added `create_filter_exec()` helper with async UDF support - Updated `extract_dml_filters()` to also extract from `TableScan.filters` ### Behavior Changes: 1. **Logical Plan**: All filters (except scalar subqueries) now appear in `TableScan.filters` instead of as separate `Filter` nodes 2. **Physical Plan**: The physical planner creates `FilterExec` nodes for Unsupported/Inexact/Volatile filters 3. **Projection Handling**: When post-scan filters need columns not in the user's projection, we expand the scan projection and add a final `ProjectionExec` to trim extra columns ## Are these changes tested? Yes - updated existing tests to match new plan representations: - Optimizer tests (snapshot updates) - Physical planner tests - Core integration tests - Dataframe and view tests ## Are there any user-facing changes? **Plan output changes**: Users will see filters in `TableScan` with `partial_filters=` or `unsupported_filters=` annotations in logical plans, rather than separate `Filter:` nodes. Physical plans remain functionally equivalent with `FilterExec` nodes where needed. --- 🤖 Generated with [Claude Code](https://claude.ai/code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
