[I] collection expression audit follow-ups (from #4473) [datafusion-comet]

via GitHub Thu, 28 May 2026 15:58:18 -0700


andygrove opened a new issue, #4504:
URL: https://github.com/apache/datafusion-comet/issues/4504


   Tracking issue for follow-up work surfaced by the collection expression 
audit in #4473. Each item below is either a support-level / serde correctness 
fix that the audit deliberately deferred, or a behavioural gap the audit 
documented but did not implement.
   
   ## High priority
   
   ### 1. `CometConcat` does not mark non-default Spark 4.0+ collations as 
`Incompatible`
   
   Spark 4.0 widens `Concat.allowedTypes` from `StringType` to 
`StringTypeWithCollation(supportsTrimCollation = true)` and preserves the 
collation in the merged result type. `CometConcat.getSupportLevel` 
(`spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238`) returns 
`Compatible()` whenever every child is `StringType`, regardless of collation, 
and the native `concat` UDF always produces `Utf8` (`UTF8_BINARY` semantics). 
Per the `audit-comet-expression` skill (rule 11), Spark 4.0+ collation gaps 
must flip the support level to `Incompatible(Some(reason))` so EXPLAIN and the 
auto-generated compatibility doc surface the divergence. Cross-reference #2190 
/ #4496.
   
   ### 2. `CometReverse` does not mark non-default Spark 4.0+ collations as 
`Incompatible`
   
   Spark 4.0 widens `Reverse.inputTypes` from `TypeCollection(StringType, 
ArrayType)` to `TypeCollection(StringTypeWithCollation(supportsTrimCollation = 
true), ArrayType)` and propagates the collation through `dataType`. 
`CometReverse.getSupportLevel` 
(`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38`)
 returns `Compatible()` for the non-array (string) branch and the native 
`reverse` UDF reverses code units, producing `Utf8`. The string branch should 
report `Incompatible(Some(reason))` for non-`UTF8_BINARY` collations, mirroring 
the concat fix. Cross-reference #2190 / #4496.
   
   ### 3. `CometReverse.getIncompatibleReasons` delegates only to the array 
branch
   
   `CometReverse.getIncompatibleReasons()` 
(`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30`)
 returns `CometArrayReverse.getIncompatibleReasons()`. Once the collation gap 
from item 2 is reflected in `getSupportLevel`, the string-collation reason also 
needs to appear in `getIncompatibleReasons()` so it reaches the compatibility 
doc. Per skill rule 2, every distinct `Incompatible(Some(r))` branch must be 
represented in the reasons method.
   
   ## Medium priority
   
   ### 4. `array/size.sql` MapType column-reference path uses 
`spark_answer_only`
   
   `spark/src/test/resources/sql-tests/expressions/array/size.sql:24-25` runs 
the MapType column-reference case as `query spark_answer_only`, which accepts 
whatever Spark returns and does not lock in that Comet falls back. The intended 
behaviour after #4472 is that `size(map_col)` falls back to Spark, so the case 
should be `query expect_fallback(...)` with the same reason string that 
`CometSize.getSupportLevel` returns for `MapType`.
   
   ### 5. `CometSize` defensive non-array / non-map branch
   
   `CometSize.getSupportLevel` 
(`spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652`) ends with 
an `Unsupported(Some(s"Unsupported child data type: $other"))` branch tagged 
"this should be unreachable because Spark only supports map and array inputs". 
Either drop the branch and let the match fail (turning a planner bug into a 
clear exception), or list the reason in `getUnsupportedReasons()` so the audit 
consistency check (skill rule 2) passes. Today the reason can be returned by 
`getSupportLevel` but is not enumerated in `getUnsupportedReasons()`, which 
only lists the `MapType` reason.
   
   ---
   
   Surfaced by the `audit-comet-expression` skill run in #4473.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] collection expression audit follow-ups (from #4473) [datafusion-comet]

Reply via email to