cloud-fan commented on a change in pull request #23303: [SPARK-26352][SQL] join
reorder should not change the order of output attributes
URL: https://github.com/apache/spark/pull/23303#discussion_r241965281
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -403,10 +404,54 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
/**
* Remove projections from the query plan that do not make any modifications.
+ * It handles top-level and intermediate [[Project]]s differently:
+ * - Top-level:
+ * A [[Project]] is only considered redundant if its output attributes
are exactly the same as
+ * its child, include the order of attributes.
+ * This affects how the outside world perceives this query plan.
+ * - Intermediate (not top-leve):
+ * A [[Project]] is redundant as long as its outputSet is the same as the
child's. It won't
+ * affect the outer appearance so we're free to change the order of the
output attributes.
+ * We should, however, retain the [[Project]]s that have a shorter output
attribute list than
+ * the child's. That can reduce the materialized data size so it's worth
keeping.
*/
object RemoveRedundantProject extends Rule[LogicalPlan] {
Review comment:
This is too risky, are there other ways to work around it? Or can we accept
sub-optimal plans?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]