Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

via GitHub Tue, 24 Jun 2025 00:25:57 -0700


kosiew commented on PR #16461:
URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2999138805


   Adding notes for future reference:
   
   ---
   
   # Summary: Adapting Filter Expressions to File Schema During Parquet Scan
   
   ---
   
   ## Background & Goal
   
   - Apache DataFusion wants to improve how filter expressions (predicates) and 
projections are adapted to the **physical file schema** during Parquet scans.
   - The effort aims to:
     - Move closer to handling nested struct schema evolution (#15780).
     - Replace older `SchemaAdapter` machinery with a new builder-based 
approach.
     - Support expression rewrites for projection and selection pushdown 
(#14993, #15057).
     - Make it easier to work with schema evolution on nested structs (#15821).
     - Enable simpler hooks for handling missing columns and expression 
transformations.
   
   ---
   
   ## Key Concepts
   
   ### 1. Expression Rewriting (Pushdown Adaptation)
   
   - Rewrites filter and projection expressions to align with the **file’s 
physical schema**.
   - Examples:
     - If an expression refers to a nested field `foo.baz` that is missing on 
disk → rewrite to `lit(NULL)`.
     - If a field has different physical type on disk vs. logical schema → add 
casts.
   - This rewriting ensures that predicate pushdown logic and filters do not 
error out when the *on-disk* schema differs from the *logical* schema.
   - Expression rewriting happens **before** reading data and uses the physical 
schema to safely prune row groups.
   
   ### 2. Data Adaptation (Batch-Level Reshaping)
   
   - After reading a `RecordBatch` or arrays from Parquet, reshape them to 
match the **logical table schema**.
   - Actions include:
     - Adding null arrays for missing nested fields (nested struct imputation).
     - Dropping columns no longer part of the logical schema.
     - Recursively casting nested struct types to match the logical type.
   - This ensures downstream operators receive data shaped exactly as expected 
in the query, despite schema evolution.
   
   ---
   
   ## Main Discussion Points
   
   | Topic                                                      | Details       
                                                                                
            |
   | ---------------------------------------------------------- | 
---------------------------------------------------------------------------------------------------------
 |
   | **Proposed Approach**                                      | Introduce a 
`PhysicalExprSchemaRewriter` builder to adapt expressions to file schema during 
pruning/scanning. |
   | **Nested Struct Imputation**                               | 
Expression-only rewrites are limited for nested structs since they do not 
modify the actual data arrays.  |
   | **Data vs. Expression Adaptation**                         | Expression 
rewriting is great for pushdown but batch-level adapters are needed for 
correct, shaped data.  |
   | **Complementary Approach**                                 | Use 
expression rewriting for filters/projections + array-level adapters (e.g. 
`cast_struct_column`) to reshape in-memory data. |
   | **Projection Pushdown Scenario**                           | A scan can 
receive full projection expressions (e.g. `a + b`), which get adapted and 
evaluated on RecordBatch, producing final output. |
   | **Risks of Expression-Only Rewrites**                      | - No effect 
on RecordBatch structure.<br>- Limited scope (only predicates and 
pruning).<br>- Risk of code duplication.<br>- Complex handling for deeply 
nested types.<br>- Possibly poorer performance due to repeated expression 
rewrites. |
   | **Potential Benefits of Expression-Based Rewrites**       | Cleaner 
pruning path, simpler code, no fake batches for evaluation, reusable visitor 
pattern.            |
   
   ---
   
   ## Diagram: Data Adaptation vs. Expression Rewriting Flow
   
   ```text
   +------------------+                          +------------------+           
               +--------------------+
   | Query Logical     |                          | File Physical    |          
                | In-Memory Batch     |
   | Schema           |                          | Schema          |            
              | (RecordBatch)       |
   | (e.g. table)      |                          | (Parquet file)   |          
                |                     |
   +------------------+                          +------------------+           
               +--------------------+
             |                                           |                      
                     |
             |                                           |                      
                     |
             |                    Expression Rewriting  |                       
                    |
             | <------------------- adapts --------------                       
                   |
             |                                           |                      
                     |
             v                                           |                      
                     |
     Filter / Projection Expressions                      |                     
                      |
    (col("foo.b") > 5)                                   |                      
                     |
             |                                           |                      
                     |
             |                      Reads columns from file projected           
              |
             | ---------------------------------------------------►  Parquet 
File                         |
             |                                           |                      
                     |
             |                                           |                      
                     |
             |               Raw RecordBatch w/ file schema  (missing nested 
fields, etc.)           |
             | <--------------------------------------------------              
                   |
             |                                           |                      
                     |
             v                       Data Adaptation (cast_struct_column, 
adapters)                  |
     Final RecordBatch shaped to  | logical schema with missing fields filled   
                    |
     logical schema (including     | (null arrays for nested missing cols)      
                       |
     nested structs corrected)     |                                           |
             |                                           |                      
                     |
             v                                           v                      
                     v
   --- Query execution continues using clean, corrected data 
----------------------------------------
   ```
   
   ---
   
   ## Takeaways
   
   - **Expression rewriting and data adaptation are complementary**:
     - Expression rewriting is for making predicate filtering safe and 
efficient at the file level.
     - Data adaptation ensures the actual data arrays match the logical schema 
exactly for query execution.
   - For nested structs, **array-level reshaping remains essential** to create 
proper null arrays for missing nested fields.
   - The current proposal introduces a reusable expression rewriting builder 
simplifying filter pushdown and helping unblock further features.
   - The discussion encourages a **two-pronged approach**:
     1. Refactor pruning/filter handling to use expression rewriter.
     2. Maintain and improve array-level schema adapters for final batch 
shaping.
   - Projection pushdown integration involves the scan receiving full 
projection expressions, rewriting them, reading the required columns, and then 
evaluating the expressions over the batch.
   
   ---
   
   ## Future Directions
   
   - Incorporate projection pushdown (#14993) fully with expression rewriting 
and data adapters.
   - Explore further optimizations to eliminate expensive casts (#16530).
   - Unify logical and physical schema adaptation logic where feasible (#16528).
   - Extend hooks in the new builder for handling missing columns and recursive 
nested struct casts.
     
   ---
   
   # Summary
   
   The issue centers on improving how DataFusion adapts filter expressions and 
projections during Parquet scans when the on-disk file schema differs from the 
logical query schema — especially for nested structs. The key is to separate 
expression rewriting (for pushdown safety) from actual batch data adaptation 
(for correctness), using a new builder abstraction for expression rewrites. 
This approach unblocks further optimizations and schema evolution support while 
keeping code maintainable and extensible.
   
   ---
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

Reply via email to