jiwen624 opened a new pull request, #56360:
URL: https://github.com/apache/spark/pull/56360
### What changes were proposed in this pull request?
`CollectSet` now keys its deduplication buffer on the normalized bit pattern
of floating-point values instead of the boxed Double/Float. In
`convertToBufferElement`, Double/Float values are stored as
`doubleToLongBits/floatToIntBits` (collapsing -0.0 to 0.0 first, with NaN
canonicalized to a single bit pattern by *ToBits), the buffer element type
becomes Long/Int, and eval converts the bits back to Double/Float.
### Why are the changes needed?
collect_set returns duplicate NaN elements which doesn't follow the NaN
semantics:
```
SELECT collect_set(v) FROM VALUES (double('NaN')), (double('NaN')) AS t(v);
-- actual: [NaN, NaN]
-- expected: [NaN]
```
### Does this PR introduce _any_ user-facing change?
Yes. collect_set over FLOAT/DOUBLE columns no longer returns duplicate NaN
values\
### How was this patch tested?
New test case.
### Was this patch authored or co-authored using generative AI tooling?
Yes. Claude Code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]