maropu commented on a change in pull request #32559:
URL: https://github.com/apache/spark/pull/32559#discussion_r633282516
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
##########
@@ -82,21 +82,31 @@ class EquivalentExpressions {
/**
* Adds only expressions which are common in each of given expressions, in a
recursive way.
* For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e +
(c + 1)))`,
- * the common expression `(c + 1)` will be added into `equivalenceMap`.
+ * the common expression `(c + 1)` will be added into `equivalenceMap`. Note
that if an
+ * expression and its child expressions are all commonly occurred in each of
given expressions,
+ * we filter out the child expressions. For example, if `((a + b) + c)` and
`(a + b)` are
+ * common expressions, we only add `((a + b) + c)`.
Review comment:
> If the redundant children expressions are counted as common
expressions too, they will be redundantly evaluated and miss the subexpression
elimination opportunity.
Could you leave comments here about why we need to filter out these exprs
here?
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
##########
@@ -82,21 +82,31 @@ class EquivalentExpressions {
/**
* Adds only expressions which are common in each of given expressions, in a
recursive way.
* For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e +
(c + 1)))`,
- * the common expression `(c + 1)` will be added into `equivalenceMap`.
+ * the common expression `(c + 1)` will be added into `equivalenceMap`. Note
that if an
+ * expression and its child expressions are all commonly occurred in each of
given expressions,
+ * we filter out the child expressions. For example, if `((a + b) + c)` and
`(a + b)` are
+ * common expressions, we only add `((a + b) + c)`.
*/
private def addCommonExprs(
exprs: Seq[Expression],
addFunc: Expression => Boolean = addExpr): Unit = {
val exprSetForAll = mutable.Set[Expr]()
addExprTree(exprs.head, addExprToSet(_, exprSetForAll))
- val commonExprSet = exprs.tail.foldLeft(exprSetForAll) { (exprSet, expr) =>
+ val candidateExprs = exprs.tail.foldLeft(exprSetForAll) { (exprSet, expr)
=>
val otherExprSet = mutable.Set[Expr]()
addExprTree(expr, addExprToSet(_, otherExprSet))
exprSet.intersect(otherExprSet)
}
- commonExprSet.foreach(expr => addFunc(expr.e))
+ // Not all expressions in the set should be added. We should filter out
the subexprs.
+ val commonExprSet = candidateExprs.filter { candidateExpr =>
+ candidateExprs.forall { expr =>
+ expr == candidateExpr ||
expr.e.find(_.semanticEquals(candidateExpr.e)).isEmpty
+ }
Review comment:
Is this loop not expensive? It seems the time-complexity is big-O(`the
total number of expr nodes in candidateExprs)` x `(candidateExprs.size)^2 `)?
##########
File path:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala
##########
@@ -82,21 +82,31 @@ class EquivalentExpressions {
/**
* Adds only expressions which are common in each of given expressions, in a
recursive way.
* For example, given two expressions `(a + (b + (c + 1)))` and `(d + (e +
(c + 1)))`,
- * the common expression `(c + 1)` will be added into `equivalenceMap`.
+ * the common expression `(c + 1)` will be added into `equivalenceMap`. Note
that if an
+ * expression and its child expressions are all commonly occurred in each of
given expressions,
+ * we filter out the child expressions. For example, if `((a + b) + c)` and
`(a + b)` are
+ * common expressions, we only add `((a + b) + c)`.
Review comment:
Just a question; even if we filter out the redundant expr (e.g., `(a +
b)` in this case) here, the suboptimal (this PR pointed out) case still can
happen if the expr, `(a + b)`, is added as a common one in the other part? I
thought a query like this: ` Seq((1, 1, 1)).toDF("a", "b",
"c").select(when($"a" + $"b" + $"c" > 0, $"a" + $"b" + $"c").when($"a" + $"b" +
$"c" <= 0, $"a" + $"b"))`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]