cloud-fan commented on a change in pull request #23556: [SPARK-26626][SQL] 
Maximum size for repeatedly substituted aliases in SQL expressions
URL: https://github.com/apache/spark/pull/23556#discussion_r248928260
 
 

 ##########
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
 ##########
 @@ -686,6 +687,28 @@ object CollapseProject extends Rule[LogicalPlan] {
     }.exists(!_.deterministic))
   }
 
+  private def hasOversizedRepeatedAliases(
+      upper: Seq[NamedExpression], lower: Seq[NamedExpression]): Boolean = {
+    val aliases = collectAliases(lower)
+
+    // Count how many times each alias is used in the upper Project.
+    // If an alias is only used once, we can safely substitute it without 
increasing the overall
+    // tree size
+    val referenceCounts = AttributeMap(
+      upper
+        .flatMap(_.collect { case a: Attribute => a })
+        .groupBy(identity)
+        .mapValues(_.size).toSeq
+    )
+
+    // Check for any aliases that are used more than once, and are larger than 
the configured
+    // maximum size
+    aliases.exists({ case (attribute, expression) =>
+      referenceCounts.getOrElse(attribute, 0) > 1 &&
+        expression.treeSize > SQLConf.get.maxRepeatedAliasSize
 
 Review comment:
   I'm not sure about using `treeSize` as the cost of an expression. UDF can be 
very expensive even if its `treeSize` is 1.
   
   How about we simplify it with a blacklist? e.g. UDF is expensive and we 
shouldn't collapse projects if udf is repeated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to