TheBuilderJR opened a new issue, #20835:
URL: https://github.com/apache/datafusion/issues/20835
### Describe the bug
## Summary
DataFusion currently supports additive schema evolution reasonably well for
plain `Struct` columns, but it fails when the evolved struct is nested inside a
container type such as `List<Struct>`.
This shows up in Parquet scans with a logical schema newer than some
physical files. If a nested struct inside a list gains a new nullable field,
DataFusion fails planning or execution instead of adapting the older files by
filling the new field with nulls.
## Version
Observed on DataFusion `52.1.0`.
## Problem
Given:
- older parquet files with a field shaped like `List(Struct(...))`
- newer parquet files where the struct inside that list has additional
nullable fields
- a scan using the latest logical schema across both old and new files
DataFusion fails with an error like:
```text
Cannot cast struct field 'messages' from type List(Struct(...old shape...))
to type List(Struct(...new shape...))
```
In my case, the concrete drift is:
- old physical files:
- `inputAsset: Struct(type, token, amount)`
- `outputAsset: Struct(type, token)`
- new logical schema:
- `inputAsset: Struct(type, token, amount, chain)`
- `outputAsset: Struct(type, token, chain)`
where both `chain` fields are nullable additions.
## Expected behavior
For additive schema evolution, DataFusion should treat nested container
cases similarly to plain `Struct` evolution:
- missing fields in older files should be filled with nulls if the target
field is nullable
- extra fields in older or newer files should be ignored when not present in
the target
- recursive adaptation should work through:
- `List`
- `LargeList`
- `FixedSizeList`
- `Map`
- combinations like `Struct -> List(Struct) -> Struct`
This should allow both narrow projections and `SELECT *` across
schema-drifted parquet files without application-side rewriting.
## Actual behavior
DataFusion succeeds for some plain `Struct` evolution scenarios, but fails
when the evolved struct is nested in a list or map-like container.
The failure appears during schema rewriting or cast validation for Parquet
scan expressions.
## Why this seems like a gap in the current implementation
From reading the current code:
- `DefaultPhysicalExprAdapterRewriter::rewrite_column` special-cases
`(Struct, Struct)` compatibility and otherwise falls back to generic
`can_cast_types`
- `datafusion_common::nested_struct::cast_column` special-cases target
`Struct` and otherwise falls back to generic Arrow casting
- as a result, `Struct` evolution gets custom handling, but `List<Struct>`
does not
So the current behavior looks like:
- supported: `Struct -> Struct` with missing or extra fields
- not supported: `List<Struct> -> List<Struct>` with additive nested fields
## Relevant code paths
These are the places that seem most relevant:
- `datafusion-common/src/nested_struct.rs`
- `cast_column`
- `validate_struct_compatibility`
- `datafusion-physical-expr-adapter/src/schema_rewriter.rs`
- `DefaultPhysicalExprAdapterRewriter::rewrite_column`
- `datafusion-physical-expr/src/expressions/cast_column.rs`
- `CastColumnExpr::evaluate`
## Minimal shape of the repro
Logical schema:
```text
data: Struct(
messages: List(
Struct(
kwargs: Struct(
tool_calls: List(
Struct(
args: Struct(
swaps: List(
Struct(
inputAsset: Struct(
amount: Struct(type, value),
token: Struct(identifier_type, value),
type,
chain
),
outputAsset: Struct(
token: Struct(identifier_type, value),
type,
chain
)
)
)
)
)
)
)
)
)
)
```
Older physical files have the same shape except `inputAsset.chain` and
`outputAsset.chain` are absent.
## Suggested fix direction
A clean fix seems to be:
1. Generalize compatibility checking from plain struct fields to recursive
nested type compatibility.
2. Extend `cast_column` to recursively adapt container types whose child or
value type contains evolved structs.
3. Use that recursive compatibility logic from the default physical
expression adapter as well.
Concretely, this likely means adding support for recursive adaptation of:
- `List`
- `LargeList`
- `FixedSizeList`
- `Map`
instead of only `Struct`.
## Proposed semantics
For nested container evolution:
- matching fields should still be cast using existing cast rules
- missing target fields should become null arrays when nullable
- nullable source to non-nullable target should still fail
- extra source fields should still be ignored
- incompatible primitive type changes should still error
## Tests that would be useful
I think the missing coverage is around:
- `List<Struct>` where target adds a nullable nested field
- `LargeList<Struct>` with the same pattern
- `FixedSizeList<Struct>` with the same pattern
- `Map<_, Struct>` or map entries containing evolved structs
- recursive case like `Struct(messages: List(Struct(...)))`
## Impact
This currently forces application-level workarounds such as preprocessing or
rewriting parquet files to the latest schema before querying, even though the
evolution is additive and nullable.
It would be much better if the default Parquet scan path handled this
directly, the same way plain `Struct` evolution is already handled.
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]