TheBuilderJR opened a new issue, #20835:
URL: https://github.com/apache/datafusion/issues/20835

   ### Describe the bug
   
   ## Summary
   
   DataFusion currently supports additive schema evolution reasonably well for 
plain `Struct` columns, but it fails when the evolved struct is nested inside a 
container type such as `List<Struct>`.
   
   This shows up in Parquet scans with a logical schema newer than some 
physical files. If a nested struct inside a list gains a new nullable field, 
DataFusion fails planning or execution instead of adapting the older files by 
filling the new field with nulls.
   
   ## Version
   
   Observed on DataFusion `52.1.0`.
   
   ## Problem
   
   Given:
   
   - older parquet files with a field shaped like `List(Struct(...))`
   - newer parquet files where the struct inside that list has additional 
nullable fields
   - a scan using the latest logical schema across both old and new files
   
   DataFusion fails with an error like:
   
   ```text
   Cannot cast struct field 'messages' from type List(Struct(...old shape...)) 
to type List(Struct(...new shape...))
   ```
   
   In my case, the concrete drift is:
   
   - old physical files:
     - `inputAsset: Struct(type, token, amount)`
     - `outputAsset: Struct(type, token)`
   - new logical schema:
     - `inputAsset: Struct(type, token, amount, chain)`
     - `outputAsset: Struct(type, token, chain)`
   
   where both `chain` fields are nullable additions.
   
   ## Expected behavior
   
   For additive schema evolution, DataFusion should treat nested container 
cases similarly to plain `Struct` evolution:
   
   - missing fields in older files should be filled with nulls if the target 
field is nullable
   - extra fields in older or newer files should be ignored when not present in 
the target
   - recursive adaptation should work through:
     - `List`
     - `LargeList`
     - `FixedSizeList`
     - `Map`
     - combinations like `Struct -> List(Struct) -> Struct`
   
   This should allow both narrow projections and `SELECT *` across 
schema-drifted parquet files without application-side rewriting.
   
   ## Actual behavior
   
   DataFusion succeeds for some plain `Struct` evolution scenarios, but fails 
when the evolved struct is nested in a list or map-like container.
   
   The failure appears during schema rewriting or cast validation for Parquet 
scan expressions.
   
   ## Why this seems like a gap in the current implementation
   
   From reading the current code:
   
   - `DefaultPhysicalExprAdapterRewriter::rewrite_column` special-cases 
`(Struct, Struct)` compatibility and otherwise falls back to generic 
`can_cast_types`
   - `datafusion_common::nested_struct::cast_column` special-cases target 
`Struct` and otherwise falls back to generic Arrow casting
   - as a result, `Struct` evolution gets custom handling, but `List<Struct>` 
does not
   
   So the current behavior looks like:
   
   - supported: `Struct -> Struct` with missing or extra fields
   - not supported: `List<Struct> -> List<Struct>` with additive nested fields
   
   ## Relevant code paths
   
   These are the places that seem most relevant:
   
   - `datafusion-common/src/nested_struct.rs`
     - `cast_column`
     - `validate_struct_compatibility`
   - `datafusion-physical-expr-adapter/src/schema_rewriter.rs`
     - `DefaultPhysicalExprAdapterRewriter::rewrite_column`
   - `datafusion-physical-expr/src/expressions/cast_column.rs`
     - `CastColumnExpr::evaluate`
   
   ## Minimal shape of the repro
   
   Logical schema:
   
   ```text
   data: Struct(
     messages: List(
       Struct(
         kwargs: Struct(
           tool_calls: List(
             Struct(
               args: Struct(
                 swaps: List(
                   Struct(
                     inputAsset: Struct(
                       amount: Struct(type, value),
                       token: Struct(identifier_type, value),
                       type,
                       chain
                     ),
                     outputAsset: Struct(
                       token: Struct(identifier_type, value),
                       type,
                       chain
                     )
                   )
                 )
               )
             )
           )
         )
       )
     )
   )
   ```
   
   Older physical files have the same shape except `inputAsset.chain` and 
`outputAsset.chain` are absent.
   
   ## Suggested fix direction
   
   A clean fix seems to be:
   
   1. Generalize compatibility checking from plain struct fields to recursive 
nested type compatibility.
   2. Extend `cast_column` to recursively adapt container types whose child or 
value type contains evolved structs.
   3. Use that recursive compatibility logic from the default physical 
expression adapter as well.
   
   Concretely, this likely means adding support for recursive adaptation of:
   
   - `List`
   - `LargeList`
   - `FixedSizeList`
   - `Map`
   
   instead of only `Struct`.
   
   ## Proposed semantics
   
   For nested container evolution:
   
   - matching fields should still be cast using existing cast rules
   - missing target fields should become null arrays when nullable
   - nullable source to non-nullable target should still fail
   - extra source fields should still be ignored
   - incompatible primitive type changes should still error
   
   ## Tests that would be useful
   
   I think the missing coverage is around:
   
   - `List<Struct>` where target adds a nullable nested field
   - `LargeList<Struct>` with the same pattern
   - `FixedSizeList<Struct>` with the same pattern
   - `Map<_, Struct>` or map entries containing evolved structs
   - recursive case like `Struct(messages: List(Struct(...)))`
   
   ## Impact
   
   This currently forces application-level workarounds such as preprocessing or 
rewriting parquet files to the latest schema before querying, even though the 
evolution is additive and nullable.
   
   It would be much better if the default Parquet scan path handled this 
directly, the same way plain `Struct` evolution is already handled.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to