zhengruifeng commented on PR #36908:
URL: https://github.com/apache/spark/pull/36908#issuecomment-1159447554
an example in _Pandas-API-on-Spark_ , the last shuffle can be optimized out.
```python
ps.set_option('compute.default_index_type', 'distributed')
df1 = ps.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'], 'value': [1, 2, 3,
5]}, columns=['lkey', 'value'])
df2 = ps.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7,
8]}, columns=['rkey', 'value'])
merged = df1.merge(df2, left_on='lkey', right_on='rkey', how='outer')
merged.sort_index().to_json()
```
before this PR:
```
== Physical Plan ==
AdaptiveSparkPlan (31)
+- == Final Plan ==
* Project (20)
+- * Sort (19)
+- AQEShuffleRead (18)
+- ShuffleQueryStage (17), Statistics(sizeInBytes=432.0 B,
rowCount=6)
+- Exchange (16)
+- * Project (15)
+- * Project (14)
+- * SortMergeJoin FullOuter (13)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4),
Statistics(sizeInBytes=128.0 B, rowCount=4)
: +- Exchange (3)
: +- * Project (2)
: +- * Scan ExistingRDD (1)
+- * Sort (12)
+- AQEShuffleRead (11)
+- ShuffleQueryStage (10),
Statistics(sizeInBytes=128.0 B, rowCount=4)
+- Exchange (9)
+- * Project (8)
+- * Scan ExistingRDD (7)
+- == Initial Plan ==
Project (30)
+- Sort (29)
+- Exchange (28)
+- Project (27)
+- Project (26)
+- SortMergeJoin FullOuter (25)
:- Sort (22)
: +- Exchange (21)
: +- Project (2)
: +- Scan ExistingRDD (1)
+- Sort (24)
+- Exchange (23)
+- Project (8)
+- Scan ExistingRDD (7)
```
after this PR:
```
== Physical Plan ==
AdaptiveSparkPlan (25)
+- == Final Plan ==
* Project (16)
+- * Project (15)
+- * Project (14)
+- * SortMergeJoin FullOuter (13)
:- * Sort (6)
: +- AQEShuffleRead (5)
: +- ShuffleQueryStage (4), Statistics(sizeInBytes=128.0 B,
rowCount=4)
: +- Exchange (3)
: +- * Project (2)
: +- * Scan ExistingRDD (1)
+- * Sort (12)
+- AQEShuffleRead (11)
+- ShuffleQueryStage (10), Statistics(sizeInBytes=128.0 B,
rowCount=4)
+- Exchange (9)
+- * Project (8)
+- * Scan ExistingRDD (7)
+- == Initial Plan ==
Project (24)
+- Project (23)
+- Project (22)
+- SortMergeJoin FullOuter (21)
:- Sort (18)
: +- Exchange (17)
: +- Project (2)
: +- Scan ExistingRDD (1)
+- Sort (20)
+- Exchange (19)
+- Project (8)
+- Scan ExistingRDD (7)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]