Github user adoron commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r235683861 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,18 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { + val colName = "i" + val doubles = Seq(0.0d, 0.0d, -0.0d).toDF(colName).groupBy(colName).count().collect() --- End diff -- Actually yes, if codegen is enabled a generated FastHashMap is used for the partial grouping before the shuffle. This map doesn't separate 0.0 and -0.0. In addition there are 2 threads for 2 partitions in the unit test. In the doubles Seq order each partition is grouped in 0.0 and so after the shuffle they are being merged. In the floats case the order of the elements in the Seq is different so in the first grouping we get 1 partition on 0.0 and the other on -0.0 and so after the shuffle they are being treated as different groups (before the fix). Now I remember this is the reason I originally disabled codegen in the test. I think I'll just reorder the doubles Seq as well so it manifests the bug.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org