andygrove opened a new issue, #4503: URL: https://github.com/apache/datafusion-comet/issues/4503
Tracking issue for follow-up work surfaced by the array expression audit in #4483. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a coverage gap the audit documented but did not implement. Already-filed gaps (#4481 NaN/signed-zero canonicalization, #4482 array_max/array_min NaN ordering, #3178 array_join null handling, #2190 collation, #3338 empty-array coerce_types) are not duplicated here. ## High priority ### 1. Lift `convert`-time `withInfo` fallbacks in `arrays.scala` into `getSupportLevel` Several array serdes bail out from `convert` with `withInfo(...) + return None` for cases that are knowable from the expression alone: - `CometArrayRemove` (`spark/src/main/scala/org/apache/comet/serde/arrays.scala:40-55`) rejects unsupported child types in `convert`. - `CometArrayExcept` (`arrays.scala:357-374`) rejects `BinaryType` / `StructType` element types in `convert` rather than via `getSupportLevel`. - `CometArrayPosition` (`arrays.scala:701-727`) rejects all-foldable args and unsupported element types in `convert`. - `CometElementAt` (`arrays.scala:565-598`) rejects non-array input in `convert`; the `MapType` rejection in particular belongs in `getSupportLevel`. - `CometFlatten` (`arrays.scala:603-617`) rejects unsupported child element types in `convert`. - `CometArrayReverse` (`arrays.scala:545-556`) rejects unsupported child types in `convert` even though it already has a `getSupportLevel`. - `CometSortArray` (`arrays.scala:172-182`) rejects non-`Literal` `ascendingOrder` in `convert`. Spark 4.0+ widens `ascendingOrder` to any foldable boolean, so this convert-time rejection is the new fallback path; it should be declared as `Unsupported(Some(...))` in `getSupportLevel` so EXPLAIN surfaces the reason. The `audit-comet-expression` skill (rule 10) requires expression-shape restrictions to be declared as `Unsupported(Some(reason))` / `Incompatible(Some(reason))` branches in `getSupportLevel` so EXPLAIN surfaces the reason at planning time, the auto-generated compatibility doc picks them up, and the dispatcher can route around them. The audit doc for `array_position`, `array_remove`, `element_at`, `flatten`, and `sort_array` explicitly notes these need to be lifted. ### 2. Documented NaN / canonicalization divergences should flip support level to `Incompatible` The audit doc flags `Known limitation` / `Known divergence` sub-bullets on several entries, but the corresponding Comet serdes still report `Compatible` (skill rule 12 requires this to flip the support level): - `CometArrayContains` (`arrays.scala:109-120`): registered with no `getSupportLevel` override (defaults to `Compatible`); audit doc flags the NaN-canonicalization gap (#4481). - `array_distinct` / `array_union` / `array_max` / `array_min`: registered as bare `CometScalarFunction(...)` in `QueryPlanSerde.scala:55-83`, so they default to `Compatible`. Audit doc flags NaN / signed-zero divergences (#4481, #4482). These should either gain a `CometExpressionSerde` wrapper with `Incompatible(Some(reason))` keyed on float/double element types, or the `CometScalarFunction` constructor should accept an optional support-level callback. ### 3. Dead serde registrations after Spark 4.0 rewrites Skill rule 13 says unreachable serde mappings should be relabeled or removed so the dispatcher map reflects reality: - `CometArrayAppend` (`arrays.scala:58-107`, registered at `QueryPlanSerde.scala:52`): unreachable on Spark 4.0+ because `ArrayAppend` is now `RuntimeReplaceable` and rewritten to `ArrayInsert(arr, Literal(-1), elem)`. The audit doc confirms dispatch flows through `CometArrayInsert` on 4.0+. Either drop the 4.0+ registration through the shim or mark it as 3.x-only. - `CometArrayCompact` (`arrays.scala:311-333`, registered at `QueryPlanSerde.scala:53`): `ArrayCompact` is always `RuntimeReplaceable` (rewritten to `ArrayFilter(arr, IsNotNull(...))` in all supported Spark versions), so the direct `classOf[ArrayCompact] -> CometArrayCompact` mapping is never hit at convert time. The actual dispatch path is `CometArrayFilter.convert -> CometArrayCompact.convert` (`arrays.scala:625-637`). The direct registration should be removed. ## Medium priority ### 4. `CometSortArray` Spark 4.0+ foldable-boolean coverage Spark 4.0 widens `SortArray.ascendingOrder` from a `Literal(_: Boolean, BooleanType)` to any foldable boolean. `CometSortArray` (`arrays.scala:172-182`) still pattern-matches only `Literal`, so `sort_array(arr, NOT some_const)` and similar shapes silently fall back on 4.0+. Constant-folding catches most cases but not all. Either evaluate the foldable expression at convert time (preferred) or surface the limitation in `getSupportLevel`. ### 5. `array_union` result ordering vs DataFusion is unverified The audit doc for `array_union` notes "Result ordering versus DataFusion is also unverified; compare the `array_intersect` ordering caveat." `CometArrayUnion` (`arrays.scala:454-466`) is currently `Compatible`. Verify whether DataFusion's `array_union` preserves Spark's left-first-then-new-right-elements ordering. If it does not, raise the support level to `Incompatible` with a documented reason, matching the `array_intersect` treatment. ### 6. `CometArrayExcept` data-type guard should be in `getSupportLevel` `CometArrayExcept.isTypeSupported` (`arrays.scala:342-355`) explicitly rejects `BinaryType` and `StructType` element types, but the rejection happens in `convert` (`arrays.scala:361-367`). Move the check into a `getSupportLevel` branch so EXPLAIN can report it and the compatibility doc reflects the gap. (Same pattern as item 1.) ### 7. `CometArrayJoin` Spark 4.0+ collation coverage Audit doc notes Spark 4.0 widens `ArrayJoin.inputTypes` to `AbstractArrayType(StringTypeWithCollation(supportsTrimCollation = true))` and that non-binary collations are not propagated. Already tracked under #2190 (umbrella) but worth wiring an explicit `hasNonDefaultStringCollation` guard in `CometArrayJoin.getSupportLevel` so the fallback is visible in EXPLAIN, matching the pattern `CometArrayIntersect` already follows (`arrays.scala:207-213`). --- Surfaced by the `audit-comet-expression` skill run in #4483. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
