Github user viirya commented on a diff in the pull request:
https://github.com/apache/spark/pull/21291#discussion_r188659776
--- Diff: python/pyspark/sql/tests.py ---
@@ -5239,8 +5239,8 @@ def test_complex_groupby(self):
expected2 = df.groupby().agg(sum(df.v))
# groupby one column and one sql expression
- result3 = df.groupby(df.id, df.v % 2).agg(sum_udf(df.v))
- expected3 = df.groupby(df.id, df.v % 2).agg(sum(df.v))
+ result3 = df.groupby(df.id, df.v %
2).agg(sum_udf(df.v)).orderBy(df.id, df.v % 2)
--- End diff --
Simply said, the data ordering between `result3` and `expect3` are
different now.
Previous query plan for two queries:
```
== Physical Plan ==
!AggregateInPandas [id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40], [sum(v#8)],
[id#0L, (v#8 % 2.0)#40 AS (v % 2)#22, sum(v)#21 AS sum(v)#23]
+- *(2) Sort [id#0L ASC NULLS FIRST, (v#8 % 2.0) AS (v#8 % 2.0)#40 ASC
NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#40, 200)
+- Generate explode(vs#4), [id#0L], false, [v#8]
+- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)),
(21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 +
cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as
double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0
+ cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4]
+- *(1) Range (0, 10, step=1, splits=8)
```
```
== Physical Plan ==
*(3) HashAggregate(keys=[id#0L, (v#8 % 2.0)#36], functions=[sum(v#8)],
output=[id#0L, (v % 2)#31, sum(v)#32])
+- Exchange hashpartitioning(id#0L, (v#8 % 2.0)#36, 200)
+- *(2) HashAggregate(keys=[id#0L, (v#8 % 2.0) AS (v#8 % 2.0)#36],
functions=[partial_sum(v#8)], output=[id#0L, (v#8 % 2.0)#36, sum#38])
+- Generate explode(vs#4), [id#0L], false, [v#8]
+- *(1) Project [id#0L, array((20.0 + cast(id#0L as double)),
(21.0 + cast(id#0L as double)), (22.0 + cast(id#0L as double)), (23.0 +
cast(id#0L as double)), (24.0 + cast(id#0L as double)), (25.0 + cast(id#0L as
double)), (26.0 + cast(id#0L as double)), (27.0 + cast(id#0L as double)), (28.0
+ cast(id#0L as double)), (29.0 + cast(id#0L as double))) AS vs#4]
+- *(1) Range (0, 10, step=1, splits=8)
```
Both have `Exchange hashpartitioning` which produces the same data
distribution previously. Notice `Sort` doesn't change data ordering because 200
partitions make sparse distribution.
Current query plan:
```
!AggregateInPandas [id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#453],
[sum(v#396)], [id#388L, (v#396 % 2.0)#453 AS (v % 2)#438, sum(v)#437 AS s
um(v)#439]
+- *(2) Sort [id#388L ASC NULLS FIRST, (v#396 % 2.0) AS (v#396 % 2.0)#453
ASC NULLS FIRST], false, 0
+- Generate explode(vs#392), [id#388L], false, [v#396]
+- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)),
(21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)),
(23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0
+ cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2
7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 +
cast(id#388L as double))) AS vs#392]
+- *(1) Range (0, 10, step=1, splits=4)
```
```
== Physical Plan ==
*(2) HashAggregate(keys=[id#388L, (v#396 % 2.0)#454],
functions=[sum(v#396)], output=[id#388L, (v % 2)#447, sum(v)#448])
+- *(2) HashAggregate(keys=[id#388L, (v#396 % 2.0) AS (v#396 % 2.0)#454],
functions=[partial_sum(v#396)], output=[id#388L, (v#396 % 2.0)#45
4, sum#456])
+- Generate explode(vs#392), [id#388L], false, [v#396]
+- *(1) Project [id#388L, array((20.0 + cast(id#388L as double)),
(21.0 + cast(id#388L as double)), (22.0 + cast(id#388L as double)),
(23.0 + cast(id#388L as double)), (24.0 + cast(id#388L as double)), (25.0
+ cast(id#388L as double)), (26.0 + cast(id#388L as double)), (2
7.0 + cast(id#388L as double)), (28.0 + cast(id#388L as double)), (29.0 +
cast(id#388L as double))) AS vs#392]
+- *(1) Range (0, 10, step=1, splits=4)
```
`Exchange` is not there anymore. They have same data distribution. But now
`Sort` changes data ordering.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]