kosiew opened a new issue, #10233:
URL: https://github.com/apache/arrow-rs/issues/10233

   ## Problem
   
   Arrow nested arrays can contain child values that are semantically hidden by 
a null parent slot. During casts, those hidden child values may still be 
inspected and can cause the cast to fail, even though the corresponding parent 
value is null and the child values are not logically visible.
   
   [This was 
observed](https://github.com/apache/datafusion/pull/22980#discussion_r3486049771)
 while adding DataFusion support for recursive schema adaptation of 
`FixedSizeList<Struct>` values. `FixedSizeListArray` always stores `len * 
list_size` child slots, including slots for null parent lists. If those hidden 
child slots contain values that fail the child cast, the nested cast fails 
unless the caller first masks the hidden child positions to null.
   
   ## Why it matters
   
   For nested arrays, parent nulls should hide child contents from value-level 
cast failures. Otherwise valid arrays can fail casts because of unreachable 
child values. This can force downstream projects to add local masking 
workarounds around Arrow casts.
   
   ## Invariant / desired behavior
   
   When casting a nested array with nullable parent slots, child values under 
null parents should not cause value-level cast failures. The cast should 
preserve the parent nulls and only require visible child values to cast 
successfully.
   
   Example shape:
   
   ```text
   FixedSizeList<Struct<a: Utf8>> list_size = 2
   parent validity: [null, valid]
   child a values: ["not_int", "also_bad", "1", "2"]
   cast target: FixedSizeList<Struct<a: Int32>>
   
   Expected:
   - parent[0] remains null
   - child slots 0 and 1 are ignored/masked because parent[0] is null
   - parent[1] casts from ["1", "2"] to [1, 2]
   
   Unexpected:
   - cast fails on "not_int" / "also_bad" even though parent[0] is null
   ```
   
   ## Proposed direction
   
   Evaluate whether Arrow cast kernels for nested arrays should be 
parent-null-aware:
   
   - For `FixedSizeList`, expand the parent null bitmap to child positions 
before recursively casting child values, or otherwise ensure hidden child 
values cannot fail the cast.
   - Consider whether the same invariant should apply to `List`, `LargeList`, 
`ListView`, and `LargeListView` when null parent slots reference non-empty 
child ranges.
   - Preserve existing behavior for visible child values and for schema/type 
incompatibilities.
   
   ## Scope
   
   ### In
   
   - Reproduce hidden-child cast failure for `FixedSizeList` with null parent 
slots.
   - Decide expected Arrow semantics for value-level cast errors under null 
nested parents.
   - Add regression coverage for at least `FixedSizeList`.
   - If accepted, update nested cast implementation so hidden child values 
under null parents do not fail casts.
   
   ### Out
   
   - DataFusion-specific schema evolution rules.
   - Struct field-addition compatibility policy.
   - Planner/runtime parity checks in DataFusion.
   - Changing behavior for visible child values that fail to cast.
   
   ## Acceptance criteria
   
   - [ ] A `FixedSizeList` cast with invalid child values under null parent 
slots succeeds when all visible child values are castable.
   - [ ] Parent null bitmap is preserved in the cast result.
   - [ ] Visible invalid child values still fail or null according to the 
configured cast options.
   - [ ] Type/schema incompatibilities are still rejected.
   - [ ] Tests clarify whether the same behavior is expected for `List` / 
`ListView` families.
   
   ## Tests / verification
   
   Suggested regression test:
   
   - Build `FixedSizeList<Struct<a: Utf8>>` with `list_size = 2`.
   - Parent validity: first list null, second list valid.
   - Child `a` values: `"not_int"`, `"also_bad"`, `"1"`, `"2"`.
   - Cast to `FixedSizeList<Struct<a: Int32>>`.
   - Assert first parent remains null, second parent is valid, and visible 
values cast to `1`, `2`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to