cloud-fan commented on a change in pull request #24049: [SPARK-27123][SQL]
Improve CollapseProject to handle projects cross limit/repartition/sample
URL: https://github.com/apache/spark/pull/24049#discussion_r264988470
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -699,6 +699,24 @@ object CollapseProject extends Rule[LogicalPlan] {
agg.copy(aggregateExpressions = buildCleanedProjectList(
p.projectList, agg.aggregateExpressions))
}
+ case p1 @ Project(_, g @ GlobalLimit(_, l @ LocalLimit(_, p2: Project))) =>
Review comment:
Sorry to be late for the review. I have 2 concerns about this optimization:
1. if `p2` outputs one column, and `p1` outputs 1000 columns, then pushing
down `p1` through limit operator would increase the data size to be shuffled.
2. if `p1` has an expensive expression like UDF, pushing `p1` through limit
operator means the expensive expression will be executed a lot more times.
Do we have a general rule to justify the benefit of pushing down the project
operator?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]