nchammas opened a new pull request, #45036: URL: https://github.com/apache/spark/pull/45036
### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? In certain cases where a) both 0.0 and -0.0 are provided as keys to the set and b) they happen to hash to the same bucket, one of the values will be dropped because the lookup indicates the value is already in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599, which affects `percentile()` and perhaps other functions that rely on `OpenHashSet` under the hood. Shifting from `==` to `equals` also changes how `NaN` is stored in the set, though the user impact of that should only be to save some memory since `NaN` will now only get one entry in the set. ### How was this patch tested? New and existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
