prakharjain09 opened a new pull request #30302:
URL: https://github.com/apache/spark/pull/30302
### Adding more test cases.
### What changes were proposed in this pull request?
This pull request tries to remove unneeded sorts in cases when we have a
Project with alias and child nodes of Project already has some outputOrder. In
such cases Project should propagate the outputOrdering informating by
normalizing the informating coming from child.
Example: consider this join of three tables:
"""
|SELECT t2id, t3.id as t3id
|FROM (
| SELECT t1.id as t1id, t2.id as t2id
| FROM t1, t2
| WHERE t1.id = t2.id
|) t12, t3
|WHERE t1id = t3.id
""".
The plan for this looks like:
*(8) Project [t2id#1059L, id#1004L AS t3id#1060L]
+- *(8) SortMergeJoin [t2id#1059L], [id#1004L], Inner
:- *(5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0
<-----------------------------
: +- *(5) Project [id#1000L AS t2id#1059L]
: +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
: :- *(2) Sort [id#996L ASC NULLS FIRST ], false, 0
: : +- Exchange hashpartitioning(id#996L, 5), true,
[id=#1426]
: : +- *(1) Range (0, 10, step=1, splits=2)
: +- *(4) Sort [id#1000L ASC NULLS FIRST ], false, 0
: +- Exchange hashpartitioning(id#1000L, 5), true,
[id=#1432]
: +- *(3) Range (0, 20, step=1, splits=2)
+- *(7) Sort [id#1004L ASC NULLS FIRST ], false, 0
+- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443]
+- *(6) Range (0, 30, step=1, splits=2)
In this plan, the marked sort node could have been avoided as the data is
already sorted on "t2.id" by the lower SortMergeJoin. This happens because
AliasAwareOutputOrdering class handles normalization only for certain specific
cases. This change normalizes all SortOrder expressions by traversing the
expression tree based on aliasing happening in Project.
### Why are the changes needed?
To remove unneeded Sort operators.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New UT added.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]