andygrove opened a new issue, #4481:
URL: https://github.com/apache/datafusion-comet/issues/4481

   ## Describe the bug
   
   Spark's `array_distinct`, `array_union`, and `array_except` use 
`SQLOpenHashSet.withNaNCheckFunc` to canonicalize NaN equality: every NaN bit 
pattern collapses to a single bucket, and `+0.0` / `-0.0` are treated as 
distinct. DataFusion's `array_distinct`, `array_union`, and `array_except` use 
hash-based set logic where NaN is not equal to NaN (IEEE semantics) and `+0.0 
== -0.0`.
   
   So for `Float`/`Double` element arrays containing NaN or signed zero, the 
results diverge silently:
   
   - `array_distinct(array(double('NaN'), double('NaN')))` returns one element 
in Spark, two in Comet.
   - `array_union(array(0.0), array(-0.0))` returns one element in Spark, two 
in Comet.
   
   Surfaced by the array-expressions audit (collection PR queue).
   
   ## Steps to reproduce
   
   ```sql
   SELECT array_distinct(array(CAST('NaN' AS DOUBLE), CAST('NaN' AS DOUBLE)));
   -- Spark: [NaN]      Comet: [NaN, NaN]
   
   SELECT array_union(array(0.0), array(-0.0));
   -- Spark: [0.0]      Comet: [0.0, -0.0]
   ```
   
   ## Expected behavior
   
   Either canonicalize NaN/signed-zero on the Comet side before invoking the 
DataFusion set operations, or downgrade `array_distinct` / `array_union` / 
`array_except` to `Incompatible(Some(...))` whenever the element type is 
`FloatType` or `DoubleType`. (`array_intersect` already has a related 
`Incompatible` flag for ordering; the NaN gap is separate.)
   
   ## Additional context
   
   - Comet serdes: `CometScalarFunction("array_distinct")` (registered in 
`QueryPlanSerde.scala`), `CometArrayUnion` and `CometArrayExcept` in 
`spark/src/main/scala/org/apache/comet/serde/arrays.scala`.
   - Spark reference: `SQLOpenHashSet.withNaNCheckFunc` and `ArraySetLike` in 
`collectionOperations.scala`.
   - Related: `array_intersect` carries an ordering `Incompatible` flag but 
does not document the NaN gap either.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to