cloud-fan commented on code in PR #41677:
URL: https://github.com/apache/spark/pull/41677#discussion_r1384915433
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##########
@@ -30,211 +30,382 @@ import org.apache.spark.util.Utils
* This class is used to compute equality of (sub)expression trees.
Expressions can be added
* to this class and they subsequently query for expression equality.
Expression trees are
* considered equal if for the same input(s), the same result is produced.
+ *
+ * Please note that `EquivalentExpressions` is mainly used in subexpression
elimination where common
+ * non-leaf expression subtrees are calculated, but there there is one special
use case in
+ * `PhysicalAggregation` where `EquivalentExpressions` is used as a mutable
set of non-deterministic
+ * expressions. For that special use case we have the `allowLeafExpressions`
config.
*/
class EquivalentExpressions(
- skipForShortcutEnable: Boolean =
SQLConf.get.subexpressionEliminationSkipForShotcutExpr) {
+ skipForShortcutEnable: Boolean =
SQLConf.get.subexpressionEliminationSkipForShotcutExpr,
+ minConditionalCount: Option[Double] =
+
Some(SQLConf.get.subexpressionEliminationMinExpectedConditionalEvaluationCount)
+ .filter(_ >= 0d),
+ allowLeafExpressions: Boolean = false) {
+
+ // The subexpressions are stored by height to speed up certain calculations.
+ private val maps = mutable.ArrayBuffer[mutable.Map[ExpressionEquals,
ExpressionStats]]()
- // For each expression, the set of equivalent expressions.
- private val equivalenceMap = mutable.HashMap.empty[ExpressionEquals,
ExpressionStats]
+ // `EquivalentExpressions` has 2 states internally, it can be either
inflated or not.
+ // The inflated state means that all added expressions have been traversed
recursively and their
+ // subexpressions are also added to `maps`. The idea behind these 2 states
is that when an
+ // expression tree is added we don't need to traverse/record its
subexpressions immediately.
+ // The typical use case of this data structure is that multiple expression
trees are added and
+ // then we want to see the common subexpressions. It might be the case that
the same expression
+ // trees or partly overlapping expressions trees are added multiple times.
With this approach we
+ // just need to record how many times an expression tree is explicitly added
when later when
+ // `getExprState()` or `getCommonSubexpressions()` is called we inflate the
data structure (do the
+ // recursive traversal and record the subexpressions in `inflate()`) if
needed.
+ private var inflated: Boolean = true
/**
- * Adds each expression to this data structure, grouping them with existing
equivalent
- * expressions. Non-recursive.
- * Returns true if there was already a matching expression.
+ * Adds each expression to this data structure and returns true if there was
already a matching
+ * expression.
*/
def addExpr(expr: Expression): Boolean = {
Review Comment:
it looks like the only difference between this and `addExprTree` is,
`addExprTree` allows non-deterministic expression. Shall we name these two
methods better?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]