andygrove opened a new issue, #4505: URL: https://github.com/apache/datafusion-comet/issues/4505
Tracking issue for follow-up work surfaced by the map expression audit in #4478. Each item below is either a correctness divergence the audit did not gate at the support-level, or a serde-consistency finding the audit did not apply. ## High priority ### 1. `CometStrToMap` does not gate Spark 4.0+ non-default string collations Spark 4.0 widens `StringToMap`'s `inputTypes` to `StringTypeNonCSAICollation` and routes both split calls through `CollationAwareUTF8String.splitSQL(..., collationId)` (`apache-spark-4.0.1` `complexTypeCreator.scala:565-622`). The Comet path registers the upstream `datafusion-spark` `SparkStrToMap` UDF (`native/core/src/execution/jni_api.rs:60,601`) which splits on the raw delimiter bytes. For any collation other than `UTF8_BINARY` (e.g. case-insensitive or accent-insensitive collations), the delimiter match in Spark and Comet would diverge silently. Per skill rule 11 (Spark 4.0+ collation gaps), `CometStrToMap` (`spark/src/main/scala/org/apache/comet/serde/maps.scala:163`) should return `Incompatible(Some(...))` for non-`UTF8_BINARY` collation on any of the three string inputs, and the reason should be enumerated in `getIncompatibleReasons()`. ### 2. `CometMapFromArrays` does not enforce Spark's null-key rejection or `spark.sql.mapKeyDedupPolicy` Spark's `MapFromArrays` (`apache-spark-3.5.8` `complexTypeCreator.scala:300`) routes through `ArrayBasedMapBuilder`, which throws `RuntimeException(\"Cannot use null as map key\")` when the keys array contains a NULL element, and applies `spark.sql.mapKeyDedupPolicy` (`EXCEPTION` vs `LAST_WIN`) for duplicate keys. `CometMapFromArrays` (`spark/src/main/scala/org/apache/comet/serde/maps.scala:76-117`) only wraps the call in a CASE WHEN that handles whole-array NULLs; it does not detect a null element inside the keys array, and it does not implement either dedup policy. For datasets that contain a NULL key or duplicate keys, Comet will silently produce a map where Spark would throw, or vice versa. `CometMapFromArrays` should declare an `Incompatible(Some(...))` branch covering null-key rejection and dedup-policy semantics, and enumerate the reasons in `getIncompatibleReasons()`. ### 3. `CometMapFromEntries` shares the null-key / dedup-policy divergence Spark's `MapFromEntries` (`apache-spark-3.5.8` `collectionOperations.scala:812`, `mapBuilder.put` at line 865) reuses `ArrayBasedMapBuilder`, so a struct array containing a NULL key or duplicate keys triggers the same null-key exception and dedup-policy enforcement as `MapFromArrays`. `CometMapFromEntries` (`spark/src/main/scala/org/apache/comet/serde/maps.scala:136`) only gates on `BinaryType` keys / values; the null-key and duplicate-key cases are unmarked. `getSupportLevel` should be extended with an `Incompatible(Some(...))` branch for these cases (or a tighter input check), with matching entries in `getIncompatibleReasons()`. ### 4. Lift `CometElementAt`'s MapType `convert`-time fallback into `getSupportLevel` `CometElementAt` (`spark/src/main/scala/org/apache/comet/serde/arrays.scala:560-599`) detects MapType input inside `convert` and calls `withInfo(expr, \"Input is not an array\") + return None`. Per skill rule 6 (\"Gate compatibility decisions in `getSupportLevel`, not inside `convert`\") and rule 10 (expression-shape restrictions belong in `getSupportLevel`), this fully-knowable case should be an `Unsupported(Some(reason))` branch keyed on `expr.left.dataType.isInstanceOf[MapType]`. `getUnsupportedReasons()` already returns the correct reason (\"Input must be an array. `Map` inputs are not supported.\"), so the EXPLAIN path would surface a cleaner message and the dispatcher would route the case uniformly. ## Medium priority ### 5. `CometMapContainsKey` is an unreachable serde `MapContainsKey` extends `RuntimeReplaceable with InheritAnalysisRules` in every audited Spark version (`apache-spark-3.5.8` `collectionOperations.scala:223-229`, same shape in 4.0.1 / 4.1.1), and its `replacement` is `ArrayContains(MapKeys(left), right)`. The optimizer rewrites `MapContainsKey` away before the physical plan reaches `QueryPlanSerde`, so the serde registered at `spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:152` and defined at `spark/src/main/scala/org/apache/comet/serde/maps.scala:119-134` is never invoked. Per skill rule 13 (unreachable serde mappings), either delete `CometMapContainsKey` plus its registration, or leave a comment explaining the path it covers. --- Surfaced by the `audit-comet-expression` skill run in #4478. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
