asugranyes commented on PR #53695:
URL: https://github.com/apache/spark/pull/53695#issuecomment-4511278234

   > Thanks for taking a look at the linked issue.
   > 
   > 
   > 
   > > As @cloud-fan noted in the discussion of #45036, what matters for these 
operators is the SQL semantic rather than alignment with java.util.HashSet. In 
Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and 
normalized to 0.0. This PR makes the hash-based array operations consistent 
with those existing SQL semantics.
   > 
   > 
   > 
   > A subtle but important point: I used the behavior of `java.util.HashSet` 
to support the changes I was making, but alignment with `java.util.HashSet` was 
_not_ the ultimate goal. The goal was to fix a plain correctness bug in the 
behavior of the percentile aggregate function. You can see a [minimal 
reproduction of that bug here][bug].
   > 
   > 
   > 
   > [bug]: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954
   > 
   > 
   > 
   > > Preserving -0.0 here would keep these array set-like operations 
inconsistent with existing Spark SQL semantics, since equivalent scalar 
operations (DISTINCT, GROUP BY) already normalize it to 0.0.
   > 
   > 
   > 
   > For reference, I checked the behavior of PostgreSQL:
   > 
   > 
   > 
   > ```sql
   > 
   > postgres=# select 0.0 union select -0.0;
   > 
   >  ?column? 
   > 
   > ----------
   > 
   >       0.0
   > 
   > (1 row)
   > 
   > 
   > 
   > postgres=# select distinct unnest(array[0.0, -0.0]);                     
   > 
   >  unnest 
   > 
   > --------
   > 
   >     0.0
   > 
   > (1 row)
   > 
   > ```
   > 
   > 
   > 
   > PostgreSQL does not have array functions like `array_{intersect, union, 
except, distinct}`. Instead, you `unnest()` the array and then use the usual 
SELECT-level operations. This produces the behavior you are advocating for.
   > 
   > 
   > 
   > PostgreSQL seems to go further in that it immediately normalizes -0.0, 
even if there is no aggregation involved:
   > 
   > 
   > 
   > ```sql
   > 
   > postgres=# select -0.0;
   > 
   >  ?column? 
   > 
   > ----------
   > 
   >       0.0
   > 
   > (1 row)
   > 
   > ```
   > 
   > 
   > 
   > If we're going to normalize, I think it would be simpler both from the 
implementation and user perspectives to just always normalize -0.0 to 0.0 like 
PostgreSQL.
   > 
   > 
   > 
   > #23388 implies that it is important that we preserve the difference 
between 0.0 and -0.0, but it's not clear why beyond that it's the status quo. 
Maybe that's good enough a justification, but I agree that the inconsistency 
across array aggregations and general result set aggregations should be 
resolved.
   
   Thanks @nchammas, I think we’re aligned then.
   
   This PR follows the same normalization direction as PostgreSQL, scoped to 
the SQL array set-like operations where equality semantics apply.
   
   Regarding #23388: I read it as broadly consistent with this approach. That 
PR introduced normalization for operators where SQL equality semantics apply 
(window partition, join, grouping keys via NormalizeFloatingNumbers), rather 
than globally canonicalizing floating-point values across Spark.
   
   This PR follows the same principle, extending it to array set-like 
operations, which were not covered at the time.
   
   Happy to iterate on the implementation if there are specific concerns. 
Otherwise, would appreciate another look when you have time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to