nchammas commented on PR #53695:
URL: https://github.com/apache/spark/pull/53695#issuecomment-4510971006

   Thanks for taking a look at the linked issue.
   
   > As @cloud-fan noted in the discussion of #45036, what matters for these 
operators is the SQL semantic rather than alignment with java.util.HashSet. In 
Spark SQL, 0.0 and -0.0 are already considered equal in grouping semantics and 
normalized to 0.0. This PR makes the hash-based array operations consistent 
with those existing SQL semantics.
   
   A subtle but important point: I used the behavior of `java.util.HashSet` to 
support the changes I was making, but alignment with `java.util.HashSet` was 
_not_ the ultimate goal. The goal was to fix a plain correctness bug in the 
behavior of the percentile aggregate function. You can see a [minimal 
reproduction of that bug here][bug].
   
   [bug]: 
https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954
   
   > Preserving -0.0 here would keep these array set-like operations 
inconsistent with existing Spark SQL semantics, since equivalent scalar 
operations (DISTINCT, GROUP BY) already normalize it to 0.0.
   
   For reference, I checked the behavior of PostgreSQL:
   
   ```sql
   postgres=# select 0.0 union select -0.0;
    ?column? 
   ----------
         0.0
   (1 row)
   
   postgres=# select distinct unnest(array[0.0, -0.0]);                     
    unnest 
   --------
       0.0
   (1 row)
   ```
   
   PostgreSQL does not have array functions like `array_{intersect, union, 
except, distinct}`. Instead, you `unnest()` the array and then use the usual 
SELECT-level operations. This produces the behavior you are advocating for.
   
   PostgreSQL seems to go further in that it immediately normalizes -0.0, even 
if there is no aggregation involved:
   
   ```sql
   postgres=# select -0.0;
    ?column? 
   ----------
         0.0
   (1 row)
   ```
   
   If we're going to normalize, I think it would be simpler both from the 
implementation and user perspectives to just always normalize -0.0 to 0.0 like 
PostgreSQL.
   
   #23388 implies that it is important that we preserve the difference between 
0.0 and -0.0, but it's not clear why beyond that it's the status quo. Maybe 
that's good enough a justification.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to