andygrove opened a new issue, #4505:
URL: https://github.com/apache/datafusion-comet/issues/4505

   Tracking issue for follow-up work surfaced by the map expression audit in 
#4478. Each item below is either a correctness divergence the audit did not 
gate at the support-level, or a serde-consistency finding the audit did not 
apply.
   
   ## High priority
   
   ### 1. `CometStrToMap` does not gate Spark 4.0+ non-default string collations
   
   Spark 4.0 widens `StringToMap`'s `inputTypes` to 
`StringTypeNonCSAICollation` and routes both split calls through 
`CollationAwareUTF8String.splitSQL(..., collationId)` (`apache-spark-4.0.1` 
`complexTypeCreator.scala:565-622`). The Comet path registers the upstream 
`datafusion-spark` `SparkStrToMap` UDF 
(`native/core/src/execution/jni_api.rs:60,601`) which splits on the raw 
delimiter bytes. For any collation other than `UTF8_BINARY` (e.g. 
case-insensitive or accent-insensitive collations), the delimiter match in 
Spark and Comet would diverge silently.
   
   Per skill rule 11 (Spark 4.0+ collation gaps), `CometStrToMap` 
(`spark/src/main/scala/org/apache/comet/serde/maps.scala:163`) should return 
`Incompatible(Some(...))` for non-`UTF8_BINARY` collation on any of the three 
string inputs, and the reason should be enumerated in 
`getIncompatibleReasons()`.
   
   ### 2. `CometMapFromArrays` does not enforce Spark's null-key rejection or 
`spark.sql.mapKeyDedupPolicy`
   
   Spark's `MapFromArrays` (`apache-spark-3.5.8` 
`complexTypeCreator.scala:300`) routes through `ArrayBasedMapBuilder`, which 
throws `RuntimeException(\"Cannot use null as map key\")` when the keys array 
contains a NULL element, and applies `spark.sql.mapKeyDedupPolicy` (`EXCEPTION` 
vs `LAST_WIN`) for duplicate keys. `CometMapFromArrays` 
(`spark/src/main/scala/org/apache/comet/serde/maps.scala:76-117`) only wraps 
the call in a CASE WHEN that handles whole-array NULLs; it does not detect a 
null element inside the keys array, and it does not implement either dedup 
policy. For datasets that contain a NULL key or duplicate keys, Comet will 
silently produce a map where Spark would throw, or vice versa.
   
   `CometMapFromArrays` should declare an `Incompatible(Some(...))` branch 
covering null-key rejection and dedup-policy semantics, and enumerate the 
reasons in `getIncompatibleReasons()`.
   
   ### 3. `CometMapFromEntries` shares the null-key / dedup-policy divergence
   
   Spark's `MapFromEntries` (`apache-spark-3.5.8` 
`collectionOperations.scala:812`, `mapBuilder.put` at line 865) reuses 
`ArrayBasedMapBuilder`, so a struct array containing a NULL key or duplicate 
keys triggers the same null-key exception and dedup-policy enforcement as 
`MapFromArrays`. `CometMapFromEntries` 
(`spark/src/main/scala/org/apache/comet/serde/maps.scala:136`) only gates on 
`BinaryType` keys / values; the null-key and duplicate-key cases are unmarked.
   
   `getSupportLevel` should be extended with an `Incompatible(Some(...))` 
branch for these cases (or a tighter input check), with matching entries in 
`getIncompatibleReasons()`.
   
   ### 4. Lift `CometElementAt`'s MapType `convert`-time fallback into 
`getSupportLevel`
   
   `CometElementAt` 
(`spark/src/main/scala/org/apache/comet/serde/arrays.scala:560-599`) detects 
MapType input inside `convert` and calls `withInfo(expr, \"Input is not an 
array\") + return None`. Per skill rule 6 (\"Gate compatibility decisions in 
`getSupportLevel`, not inside `convert`\") and rule 10 (expression-shape 
restrictions belong in `getSupportLevel`), this fully-knowable case should be 
an `Unsupported(Some(reason))` branch keyed on 
`expr.left.dataType.isInstanceOf[MapType]`. `getUnsupportedReasons()` already 
returns the correct reason (\"Input must be an array. `Map` inputs are not 
supported.\"), so the EXPLAIN path would surface a cleaner message and the 
dispatcher would route the case uniformly.
   
   ## Medium priority
   
   ### 5. `CometMapContainsKey` is an unreachable serde
   
   `MapContainsKey` extends `RuntimeReplaceable with InheritAnalysisRules` in 
every audited Spark version (`apache-spark-3.5.8` 
`collectionOperations.scala:223-229`, same shape in 4.0.1 / 4.1.1), and its 
`replacement` is `ArrayContains(MapKeys(left), right)`. The optimizer rewrites 
`MapContainsKey` away before the physical plan reaches `QueryPlanSerde`, so the 
serde registered at 
`spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:152` and 
defined at `spark/src/main/scala/org/apache/comet/serde/maps.scala:119-134` is 
never invoked. Per skill rule 13 (unreachable serde mappings), either delete 
`CometMapContainsKey` plus its registration, or leave a comment explaining the 
path it covers.
   
   ---
   
   Surfaced by the `audit-comet-expression` skill run in #4478.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to