kosiew opened a new issue, #19894:
URL: https://github.com/apache/datafusion/issues/19894

   ## Problem Statement
   
   The current DML filter extraction implementation in DataFusion exhibits a 
**fundamental architectural inconsistency**: the `TableScan` logical plan node 
maintains filters in two distinct ways:
   
   1. **Filters passed to TableProvider** - Via `delete_from()`, `update()`, 
etc., operators
   2. **Filters in LogicalPlan** - Via `TableScan.filters` field during planning
   
   This dual-track approach creates several problems:
   
   ### Current Issues
   
   1. **Filter Duplication Risk**
      - Same predicate may exist in both `Filter` node and `TableScan.filters`
      - Deduplication logic becomes complex and error-prone
      - Different code paths may process the same filter inconsistently
   
   2. **Semantic Confusion**
      - Unclear which filters are "pushed down" vs. "logical"
      - Makes it difficult to reason about query semantics
      - Complicates optimizer correctness proofs
   
   3. **Implementation Burden**
      - DML operations must collect filters from multiple locations
      - Qualifier-stripping and validation needed at collection time
      - Each new operator/optimizer rule must handle both locations
   
   4. **Multi-Table Safety Hazards**
      - UPDATE...FROM scenarios require careful tracking of which table each 
filter belongs to
      - Target table scoping becomes a bandage rather than systemic solution
      - Cross-table predicate contamination possible if new patterns emerge
   
   ### Example Scenario (UPDATE...FROM)
   
   ```sql
   UPDATE target SET col = val 
   FROM source 
   WHERE target.id = source.id AND target.status = 'active'
   ```
   
   **Current fragility:**
   - Filters could appear in: Filter node, target TableScan.filters, or source 
TableScan.filters
   - DML must track: which filters belong to target vs. source
   - Risk: Source filter accidentally applied to target during extraction
   
   **With unified design:**
   - Single, clear filter representation
   - Optimizer ensures filters on each table are stored consistently
   - DML extraction becomes straightforward and safe
   
   ---
   
   ## Proposed Solution: Unified Filter Field
   
   ### Design Goals
   
   1. **Single Source of Truth** - One representation of filters for each table
   2. **Explicit Semantics** - Clear distinction between logical and 
pushed-down filters
   3. **Safety** - Impossible for filters to be duplicated or cross-contaminated
   4. **Performance** - No additional overhead vs. current approach
   
   ### Approach
   
   Introduce a **unified filter contract** where:
   
   1. **TableScan Becomes the Filter Container**
      - `TableScan.filters` becomes the *only* place where table predicates 
live during planning
      - Filter nodes are removed/consolidated during logical planning (not 
pushed down later)
   
   2. **Filter Node Redesign**
      - Retains Filter node for predicates that can't be expressed in TableScan
      - Examples: complex expressions, cross-table joins, subqueries
      - Simple table-local predicates live in TableScan.filters
   
   3. **Optimizer Clarity**
      - "Push down" means: move predicate from Filter into TableScan.filters
      - "Pull up" means: move predicate from TableScan.filters into Filter 
(rare)
      - Symmetry and clarity in optimizer rules
   
   ### Benefits
   
   | Benefit | Current | Unified | Impact |
   |---------|---------|---------|--------|
   | Filter extraction | Multiple locations | Single location | Simpler DML 
logic |
   | Cross-table safety | Complex tracking | Scoped by design | Safer 
UPDATE...FROM |
   | Deduplication | Needed at runtime | Impossible structurally | Fewer bugs |
   | Optimizer rules | Must handle both | Single representation | Cleaner code |
   | Reasoning about plans | Unclear semantics | Explicit semantics | Better 
debugging |
   
   ---
   
   @adriangb 's [suggestion in 
#19884](https://github.com/apache/datafusion/pull/19884#issuecomment-3769536637)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to