andygrove opened a new issue, #4481:
URL: https://github.com/apache/datafusion-comet/issues/4481
## Describe the bug
Spark's `array_distinct`, `array_union`, and `array_except` use
`SQLOpenHashSet.withNaNCheckFunc` to canonicalize NaN equality: every NaN bit
pattern collapses to a single bucket, and `+0.0` / `-0.0` are treated as
distinct. DataFusion's `array_distinct`, `array_union`, and `array_except` use
hash-based set logic where NaN is not equal to NaN (IEEE semantics) and `+0.0
== -0.0`.
So for `Float`/`Double` element arrays containing NaN or signed zero, the
results diverge silently:
- `array_distinct(array(double('NaN'), double('NaN')))` returns one element
in Spark, two in Comet.
- `array_union(array(0.0), array(-0.0))` returns one element in Spark, two
in Comet.
Surfaced by the array-expressions audit (collection PR queue).
## Steps to reproduce
```sql
SELECT array_distinct(array(CAST('NaN' AS DOUBLE), CAST('NaN' AS DOUBLE)));
-- Spark: [NaN] Comet: [NaN, NaN]
SELECT array_union(array(0.0), array(-0.0));
-- Spark: [0.0] Comet: [0.0, -0.0]
```
## Expected behavior
Either canonicalize NaN/signed-zero on the Comet side before invoking the
DataFusion set operations, or downgrade `array_distinct` / `array_union` /
`array_except` to `Incompatible(Some(...))` whenever the element type is
`FloatType` or `DoubleType`. (`array_intersect` already has a related
`Incompatible` flag for ordering; the NaN gap is separate.)
## Additional context
- Comet serdes: `CometScalarFunction("array_distinct")` (registered in
`QueryPlanSerde.scala`), `CometArrayUnion` and `CometArrayExcept` in
`spark/src/main/scala/org/apache/comet/serde/arrays.scala`.
- Spark reference: `SQLOpenHashSet.withNaNCheckFunc` and `ArraySetLike` in
`collectionOperations.scala`.
- Related: `array_intersect` carries an ordering `Incompatible` flag but
does not document the NaN gap either.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]