kosiew commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2999138805
Adding notes for future reference: --- # Summary: Adapting Filter Expressions to File Schema During Parquet Scan --- ## Background & Goal - Apache DataFusion wants to improve how filter expressions (predicates) and projections are adapted to the **physical file schema** during Parquet scans. - The effort aims to: - Move closer to handling nested struct schema evolution (#15780). - Replace older `SchemaAdapter` machinery with a new builder-based approach. - Support expression rewrites for projection and selection pushdown (#14993, #15057). - Make it easier to work with schema evolution on nested structs (#15821). - Enable simpler hooks for handling missing columns and expression transformations. --- ## Key Concepts ### 1. Expression Rewriting (Pushdown Adaptation) - Rewrites filter and projection expressions to align with the **file’s physical schema**. - Examples: - If an expression refers to a nested field `foo.baz` that is missing on disk → rewrite to `lit(NULL)`. - If a field has different physical type on disk vs. logical schema → add casts. - This rewriting ensures that predicate pushdown logic and filters do not error out when the *on-disk* schema differs from the *logical* schema. - Expression rewriting happens **before** reading data and uses the physical schema to safely prune row groups. ### 2. Data Adaptation (Batch-Level Reshaping) - After reading a `RecordBatch` or arrays from Parquet, reshape them to match the **logical table schema**. - Actions include: - Adding null arrays for missing nested fields (nested struct imputation). - Dropping columns no longer part of the logical schema. - Recursively casting nested struct types to match the logical type. - This ensures downstream operators receive data shaped exactly as expected in the query, despite schema evolution. --- ## Main Discussion Points | Topic | Details | | ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- | | **Proposed Approach** | Introduce a `PhysicalExprSchemaRewriter` builder to adapt expressions to file schema during pruning/scanning. | | **Nested Struct Imputation** | Expression-only rewrites are limited for nested structs since they do not modify the actual data arrays. | | **Data vs. Expression Adaptation** | Expression rewriting is great for pushdown but batch-level adapters are needed for correct, shaped data. | | **Complementary Approach** | Use expression rewriting for filters/projections + array-level adapters (e.g. `cast_struct_column`) to reshape in-memory data. | | **Projection Pushdown Scenario** | A scan can receive full projection expressions (e.g. `a + b`), which get adapted and evaluated on RecordBatch, producing final output. | | **Risks of Expression-Only Rewrites** | - No effect on RecordBatch structure.<br>- Limited scope (only predicates and pruning).<br>- Risk of code duplication.<br>- Complex handling for deeply nested types.<br>- Possibly poorer performance due to repeated expression rewrites. | | **Potential Benefits of Expression-Based Rewrites** | Cleaner pruning path, simpler code, no fake batches for evaluation, reusable visitor pattern. | --- ## Diagram: Data Adaptation vs. Expression Rewriting Flow ```text +------------------+ +------------------+ +--------------------+ | Query Logical | | File Physical | | In-Memory Batch | | Schema | | Schema | | (RecordBatch) | | (e.g. table) | | (Parquet file) | | | +------------------+ +------------------+ +--------------------+ | | | | | | | Expression Rewriting | | | <------------------- adapts -------------- | | | | v | | Filter / Projection Expressions | | (col("foo.b") > 5) | | | | | | Reads columns from file projected | | ---------------------------------------------------► Parquet File | | | | | | | | Raw RecordBatch w/ file schema (missing nested fields, etc.) | | <-------------------------------------------------- | | | | v Data Adaptation (cast_struct_column, adapters) | Final RecordBatch shaped to | logical schema with missing fields filled | logical schema (including | (null arrays for nested missing cols) | nested structs corrected) | | | | | v v v --- Query execution continues using clean, corrected data ---------------------------------------- ``` --- ## Takeaways - **Expression rewriting and data adaptation are complementary**: - Expression rewriting is for making predicate filtering safe and efficient at the file level. - Data adaptation ensures the actual data arrays match the logical schema exactly for query execution. - For nested structs, **array-level reshaping remains essential** to create proper null arrays for missing nested fields. - The current proposal introduces a reusable expression rewriting builder simplifying filter pushdown and helping unblock further features. - The discussion encourages a **two-pronged approach**: 1. Refactor pruning/filter handling to use expression rewriter. 2. Maintain and improve array-level schema adapters for final batch shaping. - Projection pushdown integration involves the scan receiving full projection expressions, rewriting them, reading the required columns, and then evaluating the expressions over the batch. --- ## Future Directions - Incorporate projection pushdown (#14993) fully with expression rewriting and data adapters. - Explore further optimizations to eliminate expensive casts (#16530). - Unify logical and physical schema adaptation logic where feasible (#16528). - Extend hooks in the new builder for handling missing columns and recursive nested struct casts. --- # Summary The issue centers on improving how DataFusion adapts filter expressions and projections during Parquet scans when the on-disk file schema differs from the logical query schema — especially for nested structs. The key is to separate expression rewriting (for pushdown safety) from actual batch data adaptation (for correctness), using a new builder abstraction for expression rewrites. This approach unblocks further optimizations and schema evolution support while keeping code maintainable and extensible. --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org