[PR] fix: array expression audit follow-ups (#4503) [datafusion-comet]

via GitHub Tue, 23 Jun 2026 11:45:45 -0700


andygrove opened a new pull request, #4713:
URL: https://github.com/apache/datafusion-comet/pull/4713


   ## Which issue does this PR close?
   
   Closes #4503.
   
   This PR addresses the actionable follow-up items from the array expression 
audit (#4503). Item 5 (`array_union` ordering) is tracked separately in #4681 
and is not included here.
   
   ## Rationale for this change
   
   The audit recorded several support-level / serde-consistency gaps where a 
restriction was enforced in `convert` (so it never surfaced in EXPLAIN or the 
auto-generated compatibility doc), a support level disagreed with runtime 
behaviour, or a serde registration was dead after Spark 4.0 rewrites. Lifting 
these decisions into `getSupportLevel` lets EXPLAIN, the compatibility guide, 
and the codegen dispatcher all see the same truth.
   
   While working through the high-priority items, one turned out to be obsolete:
   
   - **Item 2** asked to flip `array_contains` / `array_distinct` / 
`array_union` / `array_max` / `array_min` to `Incompatible` for float/double 
elements because of NaN / signed-zero canonicalization (#4481, #4482). The 
existing SQL test files (`array_distinct.sql`, `array_max.sql`, 
`array_min.sql`, `array_union.sql`, `array_contains.sql`) already exercise 
`array(NaN, NaN)` de-duplication, `array(0.0, -0.0)` signed-zero, and 
`Infinity` cases in default `query` mode (native execution + exact Spark 
match), and they pass on the current serdes. Native DataFusion already 
canonicalizes NaN and signed zero to match Spark, so reporting `Incompatible` 
would force a needless fallback to the JVM codegen dispatcher with no 
correctness benefit and a misleading EXPLAIN label. Item 2 is therefore omitted 
as obsolete.
   
   ## What changes are included in this PR?
   
   - **ArrayReverse (item 1):** `getSupportLevel` now reports `Incompatible` 
for element types the native `array_reverse` cannot handle (binary, struct, 
map) so they route through the JVM codegen dispatcher (via `CometReverse`, 
which mixes in `CodegenDispatchFallback`) and stay native. Previously 
`StructType` reported `Compatible` while `convert` declined it, so those arrays 
silently fell back to Spark.
   - **SortArray (item 4):** Accept any foldable boolean `ascendingOrder`, not 
just a boolean `Literal`. Spark 4.0+ widens `ascendingOrder` to any foldable 
boolean; `convert` now evaluates the foldable expression (a null result unboxes 
to `false`, matching Spark's `right.eval().asInstanceOf[Boolean]`). Spark 3.x 
still only ever passes a `Literal`, so its behaviour is unchanged.
   - **ArrayJoin (item 7):** Fall back for non-default string collations, 
mirroring `CometArrayIntersect`, so the limitation is visible in EXPLAIN. The 
native `array_to_string` produces UTF8_BINARY semantics and does not propagate 
collations (#2190).
   - **ArrayExcept (items 1 & 6):** Surface the native element-type restriction 
(binary / struct) in `getSupportLevel`. It stays `Incompatible` (not 
`Unsupported`) so the codegen dispatcher still evaluates those types natively 
under the default config; the `convert`-time guard remains as a defensive net 
for the `allowIncompatible=true` path.
   - **Dead registrations (item 3):** Remove the `ArrayCompact` serde 
registration. `ArrayCompact` is `RuntimeReplaceable` in all supported Spark 
versions (rewritten to `ArrayFilter(arr, IsNotNull(...))`), so it never reaches 
serde directly; dispatch flows through `CometArrayFilter -> CometArrayCompact`. 
`ArrayAppend` is documented as reachable only on Spark 3.x (Spark 4.0+ rewrites 
it to `ArrayInsert`).
   
   ## How are these changes tested?
   
   New SQL file tests under 
`spark/src/test/resources/sql-tests/expressions/array/`:
   
   - `array_reverse.sql` covers int / string / struct / nested-struct arrays. 
The struct cases assert the dispatcher keeps execution native (runs on all 
supported Spark versions).
   - `sort_array_foldable.sql` (`MinSparkVersion: 4.0`) exercises foldable 
non-literal `ascendingOrder` (`cast(1 as boolean)` / `cast(0 as boolean)`, 
which survive the suite's disabled `ConstantFolding` as `Cast` nodes).
   - `array_join_collation.sql` (`MinSparkVersion: 4.0`) asserts collated 
inputs fall back with the expected reason.
   
   Verified on Spark 3.5 and 4.0:
   - New SQL files pass on both profiles.
   - Existing `CometArrayExpressionSuite` (43 tests) and the existing 
`array_join`, `sort_array`, `array_except`, `array_except_dispatch`, 
`array_compact` SQL tests pass with no regressions.
   - `scalafix` + `spotless` clean on both profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: array expression audit follow-ups (#4503) [datafusion-comet]

Reply via email to