nchammas commented on PR #53695: URL: https://github.com/apache/spark/pull/53695#issuecomment-4510971006
Thanks for taking a look at the linked issue. > As @cloud-fan noted in the discussion of #45036, what matters for these operators is the SQL semantic rather than alignment with java.util.HashSet. In Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and normalized to 0.0. This PR makes the hash-based array operations consistent with those existing SQL semantics. A subtle but important point: I used the behavior of `java.util.HashSet` to support the changes I was making, but alignment with `java.util.HashSet` was _not_ the ultimate goal. The goal was to fix a plain correctness bug in the behavior of the percentile aggregate function. You can see a [minimal reproduction of that bug here][bug]. [bug]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 > Preserving -0.0 here would keep these array set-like operations inconsistent with existing Spark SQL semantics, since equivalent scalar operations (DISTINCT, GROUP BY) already normalize it to 0.0. For reference, I checked the behavior of PostgreSQL: ```sql postgres=# select 0.0 union select -0.0; ?column? ---------- 0.0 (1 row) postgres=# select distinct unnest(array[0.0, -0.0]); unnest -------- 0.0 (1 row) ``` PostgreSQL does not have array functions like `array_{intersect, union, except, distinct}`. Instead, you `unnest()` the array and then use the usual SELECT-level operations. This produces the behavior you are advocating for. PostgreSQL seems to go further in that it immediately normalizes -0.0, even if there is no aggregation involved: ```sql postgres=# select -0.0; ?column? ---------- 0.0 (1 row) ``` If we're going to normalize, I think it would be simpler both from the implementation and user perspectives to just always normalize -0.0 to 0.0 like PostgreSQL. #23388 implies that it is important that we preserve the difference between 0.0 and -0.0, but it's not clear why beyond that it's the status quo. Maybe that's good enough a justification. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
