cloud-fan commented on a change in pull request #23556: [SPARK-26626][SQL]
Maximum size for repeatedly substituted aliases in SQL expressions
URL: https://github.com/apache/spark/pull/23556#discussion_r249641907
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##########
@@ -686,6 +687,28 @@ object CollapseProject extends Rule[LogicalPlan] {
}.exists(!_.deterministic))
}
+ private def hasOversizedRepeatedAliases(
+ upper: Seq[NamedExpression], lower: Seq[NamedExpression]): Boolean = {
+ val aliases = collectAliases(lower)
+
+ // Count how many times each alias is used in the upper Project.
+ // If an alias is only used once, we can safely substitute it without
increasing the overall
+ // tree size
+ val referenceCounts = AttributeMap(
+ upper
+ .flatMap(_.collect { case a: Attribute => a })
+ .groupBy(identity)
+ .mapValues(_.size).toSeq
+ )
+
+ // Check for any aliases that are used more than once, and are larger than
the configured
+ // maximum size
+ aliases.exists({ case (attribute, expression) =>
+ referenceCounts.getOrElse(attribute, 0) > 1 &&
+ expression.treeSize > SQLConf.get.maxRepeatedAliasSize
Review comment:
so your fix only care about memory usage of the expressions, instead of
execution time?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]