andygrove opened a new issue, #4504: URL: https://github.com/apache/datafusion-comet/issues/4504
Tracking issue for follow-up work surfaced by the collection expression audit in #4473. Each item below is either a support-level / serde correctness fix that the audit deliberately deferred, or a behavioural gap the audit documented but did not implement. ## High priority ### 1. `CometConcat` does not mark non-default Spark 4.0+ collations as `Incompatible` Spark 4.0 widens `Concat.allowedTypes` from `StringType` to `StringTypeWithCollation(supportsTrimCollation = true)` and preserves the collation in the merged result type. `CometConcat.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/strings.scala:232-238`) returns `Compatible()` whenever every child is `StringType`, regardless of collation, and the native `concat` UDF always produces `Utf8` (`UTF8_BINARY` semantics). Per the `audit-comet-expression` skill (rule 11), Spark 4.0+ collation gaps must flip the support level to `Incompatible(Some(reason))` so EXPLAIN and the auto-generated compatibility doc surface the divergence. Cross-reference #2190 / #4496. ### 2. `CometReverse` does not mark non-default Spark 4.0+ collations as `Incompatible` Spark 4.0 widens `Reverse.inputTypes` from `TypeCollection(StringType, ArrayType)` to `TypeCollection(StringTypeWithCollation(supportsTrimCollation = true), ArrayType)` and propagates the collation through `dataType`. `CometReverse.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:32-38`) returns `Compatible()` for the non-array (string) branch and the native `reverse` UDF reverses code units, producing `Utf8`. The string branch should report `Incompatible(Some(reason))` for non-`UTF8_BINARY` collations, mirroring the concat fix. Cross-reference #2190 / #4496. ### 3. `CometReverse.getIncompatibleReasons` delegates only to the array branch `CometReverse.getIncompatibleReasons()` (`spark/src/main/scala/org/apache/comet/serde/collectionOperations.scala:29-30`) returns `CometArrayReverse.getIncompatibleReasons()`. Once the collation gap from item 2 is reflected in `getSupportLevel`, the string-collation reason also needs to appear in `getIncompatibleReasons()` so it reaches the compatibility doc. Per skill rule 2, every distinct `Incompatible(Some(r))` branch must be represented in the reasons method. ## Medium priority ### 4. `array/size.sql` MapType column-reference path uses `spark_answer_only` `spark/src/test/resources/sql-tests/expressions/array/size.sql:24-25` runs the MapType column-reference case as `query spark_answer_only`, which accepts whatever Spark returns and does not lock in that Comet falls back. The intended behaviour after #4472 is that `size(map_col)` falls back to Spark, so the case should be `query expect_fallback(...)` with the same reason string that `CometSize.getSupportLevel` returns for `MapType`. ### 5. `CometSize` defensive non-array / non-map branch `CometSize.getSupportLevel` (`spark/src/main/scala/org/apache/comet/serde/arrays.scala:649-652`) ends with an `Unsupported(Some(s"Unsupported child data type: $other"))` branch tagged "this should be unreachable because Spark only supports map and array inputs". Either drop the branch and let the match fail (turning a planner bug into a clear exception), or list the reason in `getUnsupportedReasons()` so the audit consistency check (skill rule 2) passes. Today the reason can be returned by `getSupportLevel` but is not enumerated in `getUnsupportedReasons()`, which only lists the `MapType` reason. --- Surfaced by the `audit-comet-expression` skill run in #4473. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
