Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21291#discussion_r188659776 --- Diff: python/pyspark/sql/tests.py --- @@ -5239,8 +5239,8 @@ def test_complex_groupby(self): expected2 = df.groupby().agg(sum(df.v)) # groupby one column and one sql expression - result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v)) - expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v)) + result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v)).orderBy(df.id, df.v % 2) --- End diff -- Simply said, the data ordering between `result3` and `expect3` are different now. Previous query plan for two queries: ``` == Physical Plan == !AggregateInPandas [id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40], [sum(v#8)], [id#0L, (v#8 % 2.0)#40 AS (v % 2)#22, sum(v)#21 AS sum(v)#23] +- *(2) Sort [id#0L ASC NULLS FIRST, (v#8 % 2.0) AS (v#8 % 2.0)#40 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40, 200) +- Generate explode(vs#4), [id#0L], false, [v#8] +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), (21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 + cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4] +- *(1) Range (0, 10, step=1, splits=8) ``` ``` == Physical Plan == *(3) HashAggregate(keys=[id#0L, (v#8 % 2.0)#36], functions=[sum(v#8)], output=[id#0L, (v % 2)#31, sum(v)#32]) +- Exchange hashpartitioning(id#0L, (v#8 % 2.0)#36, 200) +- *(2) HashAggregate(keys=[id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#36], functions=[partial_sum(v#8)], output=[id#0L, (v#8 % 2.0)#36, sum#38]) +- Generate explode(vs#4), [id#0L], false, [v#8] +- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)), (21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 + cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0 + cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4] +- *(1) Range (0, 10, step=1, splits=8) ``` Both have `Exchange hashpartitioning` which produces the same data distribution previously. Notice `Sort` doesn't change data ordering because 200 partitions make sparse distribution. Current query plan: ``` !AggregateInPandas [id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#453], [sum(v#396)], [id#388L, (v#396 % 2.0)#453 AS (v % 2)#438, sum(v)#437 AS s um(v)#439] +- *(2) Sort [id#388L ASC NULLS FIRST, (v#396 % 2.0) AS (v#396 % 2.0)#453 ASC NULLS FIRST], false, 0 +- Generate explode(vs#392), [id#388L], false, [v#396] +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), (21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)), (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 + cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2 7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + cast(id#388L as double))) AS vs#392] +- *(1) Range (0, 10, step=1, splits=4) ``` ``` == Physical Plan == *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0)#454], functions=[sum(v#396)], output=[id#388L, (v % 2)#447, sum(v)#448]) +- *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#454], functions=[partial_sum(v#396)], output=[id#388L, (v#396 % 2.0)#45 4, sum#456]) +- Generate explode(vs#392), [id#388L], false, [v#396] +- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)), (21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)), (23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0 + cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2 7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 + cast(id#388L as double))) AS vs#392] +- *(1) Range (0, 10, step=1, splits=4) ``` `Exchange` is not there anymore. They have same data distribution. But now `Sort` changes data ordering.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org