Re: [PR] [SPARK-35564][SQL] Improve subexpression elimination [spark]

via GitHub Tue, 07 Nov 2023 05:28:08 -0800


cloud-fan commented on code in PR #41677:
URL: https://github.com/apache/spark/pull/41677#discussion_r1384915433



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -30,211 +30,382 @@ import org.apache.spark.util.Utils
  * This class is used to compute equality of (sub)expression trees. 
Expressions can be added
  * to this class and they subsequently query for expression equality. 
Expression trees are
  * considered equal if for the same input(s), the same result is produced.
+ *
+ * Please note that `EquivalentExpressions` is mainly used in subexpression 
elimination where common
+ * non-leaf expression subtrees are calculated, but there there is one special 
use case in
+ * `PhysicalAggregation` where `EquivalentExpressions` is used as a mutable 
set of non-deterministic
+ * expressions. For that special use case we have the `allowLeafExpressions` 
config.
  */
 class EquivalentExpressions(
-    skipForShortcutEnable: Boolean = 
SQLConf.get.subexpressionEliminationSkipForShotcutExpr) {
+    skipForShortcutEnable: Boolean = 
SQLConf.get.subexpressionEliminationSkipForShotcutExpr,
+    minConditionalCount: Option[Double] =
+      
Some(SQLConf.get.subexpressionEliminationMinExpectedConditionalEvaluationCount)
+        .filter(_ >= 0d),
+    allowLeafExpressions: Boolean = false) {
+
+  // The subexpressions are stored by height to speed up certain calculations.
+  private val maps = mutable.ArrayBuffer[mutable.Map[ExpressionEquals, 
ExpressionStats]]()
 
-  // For each expression, the set of equivalent expressions.
-  private val equivalenceMap = mutable.HashMap.empty[ExpressionEquals, 
ExpressionStats]
+  // `EquivalentExpressions` has 2 states internally, it can be either 
inflated or not.
+  // The inflated state means that all added expressions have been traversed 
recursively and their
+  // subexpressions are also added to `maps`. The idea behind these 2 states 
is that when an
+  // expression tree is added we don't need to traverse/record its 
subexpressions immediately.
+  // The typical use case of this data structure is that multiple expression 
trees are added and
+  // then we want to see the common subexpressions. It might be the case that 
the same expression
+  // trees or partly overlapping expressions trees are added multiple times. 
With this approach we
+  // just need to record how many times an expression tree is explicitly added 
when later when
+  // `getExprState()` or `getCommonSubexpressions()` is called we inflate the 
data structure (do the
+  // recursive traversal and record the subexpressions in `inflate()`) if 
needed.
+  private var inflated: Boolean = true
 
   /**
-   * Adds each expression to this data structure, grouping them with existing 
equivalent
-   * expressions. Non-recursive.
-   * Returns true if there was already a matching expression.
+   * Adds each expression to this data structure and returns true if there was 
already a matching
+   * expression.
    */
   def addExpr(expr: Expression): Boolean = {

Review Comment:
   it looks like the only difference between this and `addExprTree` is, 
`addExprTree` allows non-deterministic expression. Shall we name these two 
methods better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-35564][SQL] Improve subexpression elimination [spark]

Reply via email to