[PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

via GitHub Mon, 05 Feb 2024 18:56:46 -0800


nchammas opened a new pull request, #45036:
URL: https://github.com/apache/spark/pull/45036

### What changes were proposed in this pull request?

Change `OpenHashSet` to use object equality instead of cooperative equality
when looking up keys.

### Why are the changes needed?

In certain cases where a) both 0.0 and -0.0 are provided as keys to the set
and b) they happen to hash to the same bucket, one of the values will be
dropped because the lookup indicates the value is already in the set. This
leads to the bug described in SPARK-45599 and summarized in [this comment][1].

[1]:
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954

### Does this PR introduce _any_ user-facing change?

Yes, it resolves the bug described in SPARK-45599, which affects
`percentile()` and perhaps other functions that rely on `OpenHashSet` under the
hood.

Shifting from `==` to `equals` also changes how `NaN` is stored in the set,
though the user impact of that should only be to save some memory since `NaN`
will now only get one entry in the set.

### How was this patch tested?

New and existing unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-45599][CORE] Use object equality in OpenHashSet [spark]

Reply via email to