andygrove opened a new issue, #4503:
URL: https://github.com/apache/datafusion-comet/issues/4503

   Tracking issue for follow-up work surfaced by the array expression audit in 
#4483. Each item below is either a support-level / serde correctness fix that 
the audit deliberately deferred, or a coverage gap the audit documented but did 
not implement. Already-filed gaps (#4481 NaN/signed-zero canonicalization, 
#4482 array_max/array_min NaN ordering, #3178 array_join null handling, #2190 
collation, #3338 empty-array coerce_types) are not duplicated here.
   
   ## High priority
   
   ### 1. Lift `convert`-time `withInfo` fallbacks in `arrays.scala` into 
`getSupportLevel`
   
   Several array serdes bail out from `convert` with `withInfo(...) + return 
None` for cases that are knowable from the expression alone:
   
   - `CometArrayRemove` 
(`spark/src/main/scala/org/apache/comet/serde/arrays.scala:40-55`) rejects 
unsupported child types in `convert`.
   - `CometArrayExcept` (`arrays.scala:357-374`) rejects `BinaryType` / 
`StructType` element types in `convert` rather than via `getSupportLevel`.
   - `CometArrayPosition` (`arrays.scala:701-727`) rejects all-foldable args 
and unsupported element types in `convert`.
   - `CometElementAt` (`arrays.scala:565-598`) rejects non-array input in 
`convert`; the `MapType` rejection in particular belongs in `getSupportLevel`.
   - `CometFlatten` (`arrays.scala:603-617`) rejects unsupported child element 
types in `convert`.
   - `CometArrayReverse` (`arrays.scala:545-556`) rejects unsupported child 
types in `convert` even though it already has a `getSupportLevel`.
   - `CometSortArray` (`arrays.scala:172-182`) rejects non-`Literal` 
`ascendingOrder` in `convert`. Spark 4.0+ widens `ascendingOrder` to any 
foldable boolean, so this convert-time rejection is the new fallback path; it 
should be declared as `Unsupported(Some(...))` in `getSupportLevel` so EXPLAIN 
surfaces the reason.
   
   The `audit-comet-expression` skill (rule 10) requires expression-shape 
restrictions to be declared as `Unsupported(Some(reason))` / 
`Incompatible(Some(reason))` branches in `getSupportLevel` so EXPLAIN surfaces 
the reason at planning time, the auto-generated compatibility doc picks them 
up, and the dispatcher can route around them. The audit doc for 
`array_position`, `array_remove`, `element_at`, `flatten`, and `sort_array` 
explicitly notes these need to be lifted.
   
   ### 2. Documented NaN / canonicalization divergences should flip support 
level to `Incompatible`
   
   The audit doc flags `Known limitation` / `Known divergence` sub-bullets on 
several entries, but the corresponding Comet serdes still report `Compatible` 
(skill rule 12 requires this to flip the support level):
   
   - `CometArrayContains` (`arrays.scala:109-120`): registered with no 
`getSupportLevel` override (defaults to `Compatible`); audit doc flags the 
NaN-canonicalization gap (#4481).
   - `array_distinct` / `array_union` / `array_max` / `array_min`: registered 
as bare `CometScalarFunction(...)` in `QueryPlanSerde.scala:55-83`, so they 
default to `Compatible`. Audit doc flags NaN / signed-zero divergences (#4481, 
#4482).
   
   These should either gain a `CometExpressionSerde` wrapper with 
`Incompatible(Some(reason))` keyed on float/double element types, or the 
`CometScalarFunction` constructor should accept an optional support-level 
callback.
   
   ### 3. Dead serde registrations after Spark 4.0 rewrites
   
   Skill rule 13 says unreachable serde mappings should be relabeled or removed 
so the dispatcher map reflects reality:
   
   - `CometArrayAppend` (`arrays.scala:58-107`, registered at 
`QueryPlanSerde.scala:52`): unreachable on Spark 4.0+ because `ArrayAppend` is 
now `RuntimeReplaceable` and rewritten to `ArrayInsert(arr, Literal(-1), 
elem)`. The audit doc confirms dispatch flows through `CometArrayInsert` on 
4.0+. Either drop the 4.0+ registration through the shim or mark it as 3.x-only.
   - `CometArrayCompact` (`arrays.scala:311-333`, registered at 
`QueryPlanSerde.scala:53`): `ArrayCompact` is always `RuntimeReplaceable` 
(rewritten to `ArrayFilter(arr, IsNotNull(...))` in all supported Spark 
versions), so the direct `classOf[ArrayCompact] -> CometArrayCompact` mapping 
is never hit at convert time. The actual dispatch path is 
`CometArrayFilter.convert -> CometArrayCompact.convert` 
(`arrays.scala:625-637`). The direct registration should be removed.
   
   ## Medium priority
   
   ### 4. `CometSortArray` Spark 4.0+ foldable-boolean coverage
   
   Spark 4.0 widens `SortArray.ascendingOrder` from a `Literal(_: Boolean, 
BooleanType)` to any foldable boolean. `CometSortArray` 
(`arrays.scala:172-182`) still pattern-matches only `Literal`, so 
`sort_array(arr, NOT some_const)` and similar shapes silently fall back on 
4.0+. Constant-folding catches most cases but not all. Either evaluate the 
foldable expression at convert time (preferred) or surface the limitation in 
`getSupportLevel`.
   
   ### 5. `array_union` result ordering vs DataFusion is unverified
   
   The audit doc for `array_union` notes "Result ordering versus DataFusion is 
also unverified; compare the `array_intersect` ordering caveat." 
`CometArrayUnion` (`arrays.scala:454-466`) is currently `Compatible`. Verify 
whether DataFusion's `array_union` preserves Spark's 
left-first-then-new-right-elements ordering. If it does not, raise the support 
level to `Incompatible` with a documented reason, matching the 
`array_intersect` treatment.
   
   ### 6. `CometArrayExcept` data-type guard should be in `getSupportLevel`
   
   `CometArrayExcept.isTypeSupported` (`arrays.scala:342-355`) explicitly 
rejects `BinaryType` and `StructType` element types, but the rejection happens 
in `convert` (`arrays.scala:361-367`). Move the check into a `getSupportLevel` 
branch so EXPLAIN can report it and the compatibility doc reflects the gap. 
(Same pattern as item 1.)
   
   ### 7. `CometArrayJoin` Spark 4.0+ collation coverage
   
   Audit doc notes Spark 4.0 widens `ArrayJoin.inputTypes` to 
`AbstractArrayType(StringTypeWithCollation(supportsTrimCollation = true))` and 
that non-binary collations are not propagated. Already tracked under #2190 
(umbrella) but worth wiring an explicit `hasNonDefaultStringCollation` guard in 
`CometArrayJoin.getSupportLevel` so the fallback is visible in EXPLAIN, 
matching the pattern `CometArrayIntersect` already follows 
(`arrays.scala:207-213`).
   
   ---
   
   Surfaced by the `audit-comet-expression` skill run in #4483.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to