asugranyes commented on PR #53695: URL: https://github.com/apache/spark/pull/53695#issuecomment-4511278234
> Thanks for taking a look at the linked issue. > > > > > As @cloud-fan noted in the discussion of #45036, what matters for these operators is the SQL semantic rather than alignment with java.util.HashSet. In Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and normalized to 0.0. This PR makes the hash-based array operations consistent with those existing SQL semantics. > > > > A subtle but important point: I used the behavior of `java.util.HashSet` to support the changes I was making, but alignment with `java.util.HashSet` was _not_ the ultimate goal. The goal was to fix a plain correctness bug in the behavior of the percentile aggregate function. You can see a [minimal reproduction of that bug here][bug]. > > > > [bug]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 > > > > > Preserving -0.0 here would keep these array set-like operations inconsistent with existing Spark SQL semantics, since equivalent scalar operations (DISTINCT, GROUP BY) already normalize it to 0.0. > > > > For reference, I checked the behavior of PostgreSQL: > > > > ```sql > > postgres=# select 0.0 union select -0.0; > > ?column? > > ---------- > > 0.0 > > (1 row) > > > > postgres=# select distinct unnest(array[0.0, -0.0]); > > unnest > > -------- > > 0.0 > > (1 row) > > ``` > > > > PostgreSQL does not have array functions like `array_{intersect, union, except, distinct}`. Instead, you `unnest()` the array and then use the usual SELECT-level operations. This produces the behavior you are advocating for. > > > > PostgreSQL seems to go further in that it immediately normalizes -0.0, even if there is no aggregation involved: > > > > ```sql > > postgres=# select -0.0; > > ?column? > > ---------- > > 0.0 > > (1 row) > > ``` > > > > If we're going to normalize, I think it would be simpler both from the implementation and user perspectives to just always normalize -0.0 to 0.0 like PostgreSQL. > > > > #23388 implies that it is important that we preserve the difference between 0.0 and -0.0, but it's not clear why beyond that it's the status quo. Maybe that's good enough a justification, but I agree that the inconsistency across array aggregations and general result set aggregations should be resolved. Thanks @nchammas, I think we’re aligned then. This PR follows the same normalization direction as PostgreSQL, scoped to the SQL array set-like operations where equality semantics apply. Regarding #23388: I read it as broadly consistent with this approach. That PR introduced normalization for operators where SQL equality semantics apply (window partition, join, grouping keys via NormalizeFloatingNumbers), rather than globally canonicalizing floating-point values across Spark. This PR follows the same principle, extending it to array set-like operations, which were not covered at the time. Happy to iterate on the implementation if there are specific concerns. Otherwise, would appreciate another look when you have time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
