wangyum commented on a change in pull request #35214:
URL: https://github.com/apache/spark/pull/35214#discussion_r785770644
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -766,18 +767,24 @@ object PushProjectionThroughUnion extends
Rule[LogicalPlan] with PredicateHelper
_.containsAllPatterns(UNION, PROJECT)) {
// Push down deterministic projection through UNION ALL
- case p @ Project(projectList, u: Union) =>
+ case Project(projectList, u: Union) if projectList.forall(_.deterministic)
=>
assert(u.children.nonEmpty)
- if (projectList.forall(_.deterministic)) {
- val newFirstChild = Project(projectList, u.children.head)
- val newOtherChildren = u.children.tail.map { child =>
- val rewrites = buildRewrites(u.children.head, child)
- Project(projectList.map(pushToRight(_, rewrites)), child)
- }
- u.copy(children = newFirstChild +: newOtherChildren)
- } else {
- p
+ val newFirstChild = Project(projectList, u.children.head)
+ val newOtherChildren = u.children.tail.map { child =>
+ val rewrites = buildRewrites(u.children.head, child)
+ Project(projectList.map(pushToRight(_, rewrites)), child)
+ }
+ u.copy(children = newFirstChild +: newOtherChildren)
+
+ // Push down deterministic projection through SQL UNION
+ case Project(projectList, Distinct(u: Union)) if
projectList.forall(_.deterministic) =>
Review comment:
Before this pr, there is a `Project` between `Distinct` and `Union`,
`CombineUnions` can not combine them, and later `ReplaceDistinctWithAggregate`
will replace `Distinct` with `Aggregate`.
```
Distinct
+- Union false, false
:- Project [cast(id#34 as decimal(22,5)) AS id#36]
: +- Distinct
: +- Union false, false
: :- Project [cast(id#32 as decimal(21,4)) AS id#34]
: : +- Distinct
: : +- Union false, false
: : :- Project [cast(id#30 as decimal(20,3)) AS id#32]
: : : +- Distinct
: : : +- Union false, false
: : : :- Project [cast(id#25 as decimal(19,2)) AS
id#30]
: : : : +- Relation default.t1[id#25] parquet
: : : +- Project [cast(id#26 as decimal(19,2)) AS
id#31]
: : : +- Relation default.t2[id#26] parquet
: : +- Project [cast(id#27 as decimal(20,3)) AS id#33]
: : +- Relation default.t3[id#27] parquet
: +- Project [cast(id#28 as decimal(21,4)) AS id#35]
: +- Relation default.t4[id#28] parquet
+- Project [cast(id#29 as decimal(22,5)) AS id#37]
+- Relation default.t5[id#29] parquet
```
After this pr, we first push project through union and then `CombineUnions`
can combine them.
```
=== Result of Batch PushProjectionThroughUnion ===
Distinct
Distinct
+- Union false, false
+- Union false, false
! :- Project [cast(id#34 as decimal(22,5)) AS id#36]
:- Distinct
! : +- Distinct
: +- Union false, false
! : +- Union false, false
: :- Distinct
! : :- Project [cast(id#32 as decimal(21,4)) AS id#34]
: : +- Union false, false
! : : +- Distinct
: : :- Distinct
! : : +- Union false, false
: : : +- Union false, false
! : : :- Project [cast(id#30 as decimal(20,3)) AS id#32]
: : : :- Project [cast(id#34 as decimal(22,5)) AS id#36]
! : : : +- Distinct
: : : : +- Project [cast(id#32 as decimal(21,4)) AS
id#34]
! : : : +- Union false, false
: : : : +- Project [cast(id#30 as decimal(20,3)) AS
id#32]
! : : : :- Project [cast(id#25 as decimal(19,2)) AS
id#30] : : : : +- Project [cast(id#25 as decimal(19,2))
AS id#30]
! : : : : +- Project [id#25]
: : : : +- Project [id#25]
! : : : : +- Relation default.t1[id#25] parquet
: : : : +- Relation default.t1[id#25] parquet
! : : : +- Project [cast(id#26 as decimal(19,2)) AS
id#31] : : : +- Project [cast(id#48 as decimal(22,5)) AS id#49]
! : : : +- Project [id#26]
: : : +- Project [cast(id#46 as decimal(21,4)) AS
id#48]
! : : : +- Relation default.t2[id#26] parquet
: : : +- Project [cast(id#31 as decimal(20,3)) AS
id#46]
! : : +- Project [cast(id#27 as decimal(20,3)) AS id#33]
: : : +- Project [cast(id#26 as decimal(19,2))
AS id#31]
! : : +- Project [id#27]
: : : +- Project [id#26]
! : : +- Relation default.t3[id#27] parquet
: : : +- Relation default.t2[id#26] parquet
! : +- Project [cast(id#28 as decimal(21,4)) AS id#35]
: : +- Project [cast(id#45 as decimal(22,5)) AS id#47]
! : +- Project [id#28]
: : +- Project [cast(id#33 as decimal(21,4)) AS id#45]
! : +- Relation default.t4[id#28] parquet
: : +- Project [cast(id#27 as decimal(20,3)) AS id#33]
! +- Project [cast(id#29 as decimal(22,5)) AS id#37]
: : +- Project [id#27]
! +- Project [id#29]
: : +- Relation default.t3[id#27] parquet
! +- Relation default.t5[id#29] parquet
: +- Project [cast(id#35 as decimal(22,5)) AS id#44]
!
: +- Project [cast(id#28 as decimal(21,4)) AS id#35]
!
: +- Project [id#28]
!
: +- Relation default.t4[id#28] parquet
!
+- Project [cast(id#29 as decimal(22,5)) AS id#37]
!
+- Project [id#29]
!
+- Relation default.t5[id#29] parquet
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]