kosiew opened a new pull request, #20038:
URL: https://github.com/apache/datafusion/pull/20038

   ## Which issue does this PR close?
   
   * Closes #17330.
   
   ---
   
   ## Rationale for this change
   
   The existing `PhysicalExprAdapter` and casting infrastructure relied 
primarily on `CastExpr` with Arrow `CastOptions<'static>`, which imposed 
several limitations:
   
   * It required `'static` string lifetimes for format options, making it 
unsafe or impractical to construct cast options dynamically (e.g. from SQL, 
protobuf, or IPC).
   * Struct-aware casting and nullability validation were fragmented across 
multiple call sites, leading to subtle correctness issues (especially around 
nullable → non-nullable casts).
   * The adapter produced generic `CastExpr` nodes even when column-aware 
semantics were required, complicating optimization, equivalence reasoning, 
interval analysis, and pruning.
   
   This PR addresses these issues by fully integrating `CastColumnExpr` into 
the physical planning pipeline, introducing owned cast/format options, and 
tightening schema- and nullability-aware validation across DataFusion.
   
   ---
   
   ## What changes are included in this PR?
   
   ### 1. Owned cast and format options
   
   * Introduces `OwnedFormatOptions` and `OwnedCastOptions` in 
`datafusion-common`.
   * Eliminates the need for `FormatOptions<'static>` and prevents memory leaks 
or string interning.
   * Provides safe, ephemeral conversion to Arrow `CastOptions<'_>` for 
execution.
   
   ### 2. `CastColumnExpr` integration
   
   * Refactors the `PhysicalExprAdapter` to emit `CastColumnExpr` instead of 
`CastExpr` for column casts.
   * Adds robust validation via `validate_field_compatibility` and 
`validate_struct_compatibility`.
   * Ensures nullable → non-nullable casts are rejected early and consistently.
   
   ### 3. Schema rewriter cleanup
   
   * Simplifies and clarifies schema-rewrite logic with helper routines.
   * Correctly handles column index mismatches, reordered schemas, and nested 
structs.
   * Improves error messages and correctness for mismatched physical vs logical 
schemas.
   
   ### 4. Optimizer and execution support
   
   * Extends:
   
     * Equivalence properties
     * Ordering propagation
     * Interval reasoning
     * Cast-unwrapping simplifications
     * Statistics-based pruning
       to recognize and reason about `CastColumnExpr`.
   
   ### 5. Serialization / deserialization
   
   * Adds protobuf support for `CastColumnExpr`, `PhysicalCastOptions`, and 
`FormatOptions`.
   * Maintains backward compatibility with pre-43.0 fields (`safe`, 
`format_options`).
   * Enables distributed and IPC round-tripping of plans containing 
`CastColumnExpr`.
   
   ### 6. Nullability correctness fixes
   
   * Updates tests and examples to use nullable logical schemas where 
appropriate.
   * Fixes incorrect assumptions that missing columns are non-nullable.
   
   ---
   
   ## Are these changes tested?
   
   Yes. This PR adds and updates extensive test coverage, including:
   
   * Unit tests for `CastColumnExpr` construction, evaluation, and validation.
   * Tests for nullable vs non-nullable casting behavior.
   * Schema rewrite and adapter behavior tests.
   * Optimizer tests covering ordering, equivalence classes, interval 
reasoning, and cast unwrapping.
   * Protobuf (de)serialization round-trip tests.
   
   Existing tests were also updated where schema nullability assumptions 
changed.
   
   ---
   
   ## Are there any user-facing changes?
   
   * **Behavioral fixes:** Queries involving missing columns, reordered 
schemas, or nested structs now behave correctly with respect to nullability.
   * **Improved correctness:** Invalid nullable → non-nullable casts now fail 
deterministically at planning time.
   * **No SQL syntax changes** are introduced.
   
   There are no intentional breaking API changes, though downstream users 
relying on internal physical expressions may need to account for 
`CastColumnExpr`.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed, validated, and tested before submission.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to