Github user adoron commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23043#discussion_r235683861
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala ---
    @@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with 
SharedSQLContext {
           "grouping expressions: [current_date(None)], value: [key: int, 
value: string], " +
             "type: GroupBy]"))
       }
    +
    +  test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when 
grouping") {
    +    val colName = "i"
    +    val doubles = Seq(0.0d, 0.0d, 
-0.0d).toDF(colName).groupBy(colName).count().collect()
    --- End diff --
    
    Actually yes, if codegen is enabled a generated FastHashMap is used for the 
partial grouping before the shuffle. This map doesn't separate 0.0 and -0.0. In 
addition there are 2 threads for 2 partitions in the unit test. In the doubles 
Seq order each partition is grouped in 0.0 and so after the shuffle they are 
being merged.
    In the floats case the order of the elements in the Seq is different so in 
the first grouping we get 1 partition on 0.0 and the other on -0.0 and so after 
the shuffle they are being treated as different groups (before the fix).
    Now I remember this is the reason I originally disabled codegen in the test.
    I think I'll just reorder the doubles Seq as well so it manifests the bug.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to