[PR] [SPARK-57298][SQL] collect_set returns duplicate NaN values for float/double columns [spark]

via GitHub Sat, 06 Jun 2026 20:34:32 -0700


jiwen624 opened a new pull request, #56360:
URL: https://github.com/apache/spark/pull/56360


   ### What changes were proposed in this pull request?
   `CollectSet` now keys its deduplication buffer on the normalized bit pattern 
of floating-point values instead of the boxed Double/Float. In 
`convertToBufferElement`, Double/Float values are stored as 
`doubleToLongBits/floatToIntBits` (collapsing -0.0 to 0.0 first, with NaN 
canonicalized to a single bit pattern by *ToBits), the buffer element type 
becomes Long/Int, and eval converts the bits back to Double/Float.
   
   ### Why are the changes needed?
   collect_set returns duplicate NaN elements which doesn't follow the NaN 
semantics:
   ```
   SELECT collect_set(v) FROM VALUES (double('NaN')), (double('NaN')) AS t(v);
   -- actual:   [NaN, NaN]
   -- expected: [NaN] 
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes. collect_set over FLOAT/DOUBLE columns no longer returns duplicate NaN 
values\
   
   ### How was this patch tested?
   New test case.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   Yes. Claude Code.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57298][SQL] collect_set returns duplicate NaN values for float/double columns [spark]

Reply via email to