kosiew opened a new issue, #22986:
URL: https://github.com/apache/datafusion/issues/22986

   ## Summary
   
   `datafusion/functions-nested/src/map.rs` has parallel code paths for 
decomposing list-like key/value inputs when building `MapArray` values. The 
`List`/`LargeList` path and the `FixedSizeList` path both perform the same 
conceptual steps: preserve input row count and null-map bitmap, decompose each 
non-null row into key/value arrays, skip value rows corresponding to null key 
maps, then call the shared `build_map_array` helper.
   
   This duplication is a maintainability risk. A small shared abstraction for 
“rows of list-like arrays” would make map construction easier to reason about 
and reduce the chance that future fixes to null-map handling, offset 
accounting, or length validation are applied to one list representation but 
missed in another.
   
   ## Current code involved
   
   Relevant functions in `datafusion/functions-nested/src/map.rs` as [at 
commit](https://github.com/apache/datafusion/tree/96a6096c6f4b924e8cab4bc1629759a948e12939):
   
   - `make_map_array_internal<O: OffsetSizeTrait>` for `List` and `LargeList`
   - `make_map_array_from_fixed_size_list` for `FixedSizeList`
   - `list_to_arrays_skipping_null_rows<O: OffsetSizeTrait>`
   - `fixed_size_list_to_arrays_skipping_null_rows`
   - `build_map_array`
   
   ## Problem
   
   The `List`/`LargeList` and `FixedSizeList` construction paths have nearly 
identical control flow:
   
   1. Save the original key/value data types.
   2. Save the original input row count.
   3. Capture the key-array null bitmap, which represents null map rows.
   4. Decompose key rows into `Vec<ArrayRef>`.
   5. Decompose value rows while skipping rows where the corresponding key map 
is null.
   6. Delegate to `build_map_array`.
   
   The row-skipping helpers are also duplicated, differing mainly in how they 
iterate list rows (`as_list::<O>()` vs `as_fixed_size_list()`).
   
   Because these paths encode the same invariants separately, future changes 
can drift. Areas at risk include:
   
   - preserving null map rows rather than treating them as null keys
   - keeping map offsets aligned with original input rows
   - handling all-null maps with correct element data types
   - validating matching key/value row counts consistently
   - maintaining consistent behavior between `List`, `LargeList`, and 
`FixedSizeList`
   
   ## Proposed direction
   
   Introduce a small shared abstraction for list-like row decomposition used by 
map construction.
   
   Possible shapes:
   
   ### Option A: helper taking row iterators
   
   Create one helper that accepts already-decomposed key rows and a value-row 
iterator/filter strategy, then calls `build_map_array`.
   
   ### Option B: enum-backed adapter
   
   Add an internal enum such as `ListLikeArray<'a>` with variants for:
   
   - `List32(&'a ArrayRef)`
   - `List64(&'a ArrayRef)`
   - `FixedSizeList(&'a ArrayRef)`
   
   Expose methods like:
   
   - `data_type()`
   - `len()`
   - `nulls()`
   - `rows()` / `non_null_rows_skipping(nulls)`
   
   ### Option C: small trait/internal helper functions
   
   Use a private trait or generic helper to hide the row iteration difference 
while keeping the current dispatch from `DataType`.
   
   Any option should keep the public API unchanged and avoid broad 
architectural changes.
   
   ## Suggested implementation constraints
   
   - Keep this refactor local to `datafusion/functions-nested/src/map.rs` 
unless a clearly reusable utility already belongs elsewhere.
   - Preserve current error messages where practical.
   - Preserve `List`, `LargeList`, and `FixedSizeList` behavior exactly.
   - Do not change SQL-visible semantics.
   - Keep `build_map_array` as the single place that constructs offsets, 
flattened arrays, struct data, and the final `MapArray`, unless the refactor 
makes an equally clear replacement.
   
   ## Test coverage to preserve/add
   
   Existing unit tests around map construction should continue to pass, 
especially coverage for:
   
   - null map rows
   - null key rejection within a map
   - `LargeList` input
   - `FixedSizeList` input
   - mixed scalar/array map arguments
   
   Useful additional targeted tests if gaps remain:
   
   1. `List` and `FixedSizeList` inputs with the same null-row pattern produce 
equivalent map nulls and offsets.
   2. All map rows null still produce an empty child array with the expected 
key/value element types.
   3. Mismatched key/value row counts fail consistently across `List`, 
`LargeList`, and `FixedSizeList`.
   
   ## Acceptance criteria
   
   - The duplicate row-skipping logic for `List`/`LargeList` and 
`FixedSizeList` is unified or reduced to a thin adapter layer.
   - Map construction invariants are documented in one place near the shared 
helper.
   - Existing map unit tests pass.
   - Any newly added tests pass with `cargo test -p datafusion-functions-nested 
--lib map::tests`.
   - No public API or SQL behavior changes.
   
   ## Scope
   
   This is a refactor only. It is not intended to add new map semantics or 
change validation behavior.
   
   ## Related PR
   #22784
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to