nchammas commented on code in PR #45036:
URL: https://github.com/apache/spark/pull/45036#discussion_r1486895251
##########
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala:
##########
@@ -269,4 +269,35 @@ class OpenHashSetSuite extends SparkFunSuite with Matchers
{
assert(pos1 == pos2)
}
}
+
+ test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
Review Comment:
> Spark's `OpenHashSet` does not have to match `java.util.HashSet`. What
matters is the SQL semantic.
Whether or not `OpenHashSet` matches `java.util.HashSet`, I want to
emphasize for the record that `OpenHashSet` mishandles 0.0/-0.0 and NaN. Its
behavior is simply _incorrect_. [These][test1] [tests][test2] fail on `master`
in ways that can only be described as bugs, regardless of whatever SQL
semantics we want to preserve.
[This comment][why] explains the root cause. Basically, it is a mistake to
combine hash code-based lookups with cooperative equality, at least in the way
we are doing it in `OpenHashSet`.
[test1]:
https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R274-R278
[test2]:
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R308-R309
[why]:
https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
But I understand what you are saying. Fixing bugs in `OpenHashSet` doesn't
help us if it also breaks users' SQL.
> Can you highlight which functions/operators are using this `OpenHashSet`
I've updated the PR description with a summary of what uses `OpenHashSet`.
As a side note, I believe that if we accept the change proposed here, we
should be able to eliminate `SQLOpenHashSet`. `SQLOpenHashSet` was created
specifically to work around the bugs in `OpenHashSet` that we are addressing in
this PR. See #33955 and #33993.
> and what is the impact of this change to the SQL semantic?
I've updated the PR description with a diff of what tests pass or fail on
`master` vs. this branch. Please take a look and let me know if you think we
need any more tests. I know we are touching a sensitive code path and I
appreciate the need for caution.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]