andygrove opened a new pull request, #4713: URL: https://github.com/apache/datafusion-comet/pull/4713
## Which issue does this PR close? Closes #4503. This PR addresses the actionable follow-up items from the array expression audit (#4503). Item 5 (`array_union` ordering) is tracked separately in #4681 and is not included here. ## Rationale for this change The audit recorded several support-level / serde-consistency gaps where a restriction was enforced in `convert` (so it never surfaced in EXPLAIN or the auto-generated compatibility doc), a support level disagreed with runtime behaviour, or a serde registration was dead after Spark 4.0 rewrites. Lifting these decisions into `getSupportLevel` lets EXPLAIN, the compatibility guide, and the codegen dispatcher all see the same truth. While working through the high-priority items, one turned out to be obsolete: - **Item 2** asked to flip `array_contains` / `array_distinct` / `array_union` / `array_max` / `array_min` to `Incompatible` for float/double elements because of NaN / signed-zero canonicalization (#4481, #4482). The existing SQL test files (`array_distinct.sql`, `array_max.sql`, `array_min.sql`, `array_union.sql`, `array_contains.sql`) already exercise `array(NaN, NaN)` de-duplication, `array(0.0, -0.0)` signed-zero, and `Infinity` cases in default `query` mode (native execution + exact Spark match), and they pass on the current serdes. Native DataFusion already canonicalizes NaN and signed zero to match Spark, so reporting `Incompatible` would force a needless fallback to the JVM codegen dispatcher with no correctness benefit and a misleading EXPLAIN label. Item 2 is therefore omitted as obsolete. ## What changes are included in this PR? - **ArrayReverse (item 1):** `getSupportLevel` now reports `Incompatible` for element types the native `array_reverse` cannot handle (binary, struct, map) so they route through the JVM codegen dispatcher (via `CometReverse`, which mixes in `CodegenDispatchFallback`) and stay native. Previously `StructType` reported `Compatible` while `convert` declined it, so those arrays silently fell back to Spark. - **SortArray (item 4):** Accept any foldable boolean `ascendingOrder`, not just a boolean `Literal`. Spark 4.0+ widens `ascendingOrder` to any foldable boolean; `convert` now evaluates the foldable expression (a null result unboxes to `false`, matching Spark's `right.eval().asInstanceOf[Boolean]`). Spark 3.x still only ever passes a `Literal`, so its behaviour is unchanged. - **ArrayJoin (item 7):** Fall back for non-default string collations, mirroring `CometArrayIntersect`, so the limitation is visible in EXPLAIN. The native `array_to_string` produces UTF8_BINARY semantics and does not propagate collations (#2190). - **ArrayExcept (items 1 & 6):** Surface the native element-type restriction (binary / struct) in `getSupportLevel`. It stays `Incompatible` (not `Unsupported`) so the codegen dispatcher still evaluates those types natively under the default config; the `convert`-time guard remains as a defensive net for the `allowIncompatible=true` path. - **Dead registrations (item 3):** Remove the `ArrayCompact` serde registration. `ArrayCompact` is `RuntimeReplaceable` in all supported Spark versions (rewritten to `ArrayFilter(arr, IsNotNull(...))`), so it never reaches serde directly; dispatch flows through `CometArrayFilter -> CometArrayCompact`. `ArrayAppend` is documented as reachable only on Spark 3.x (Spark 4.0+ rewrites it to `ArrayInsert`). ## How are these changes tested? New SQL file tests under `spark/src/test/resources/sql-tests/expressions/array/`: - `array_reverse.sql` covers int / string / struct / nested-struct arrays. The struct cases assert the dispatcher keeps execution native (runs on all supported Spark versions). - `sort_array_foldable.sql` (`MinSparkVersion: 4.0`) exercises foldable non-literal `ascendingOrder` (`cast(1 as boolean)` / `cast(0 as boolean)`, which survive the suite's disabled `ConstantFolding` as `Cast` nodes). - `array_join_collation.sql` (`MinSparkVersion: 4.0`) asserts collated inputs fall back with the expected reason. Verified on Spark 3.5 and 4.0: - New SQL files pass on both profiles. - Existing `CometArrayExpressionSuite` (43 tests) and the existing `array_join`, `sort_array`, `array_except`, `array_except_dispatch`, `array_compact` SQL tests pass with no regressions. - `scalafix` + `spotless` clean on both profiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
