Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
Kimahriman commented on PR #32987:
URL: https://github.com/apache/spark/pull/32987#issuecomment-4006359564
Wrote a benchmark for this
```scala
object ConditionalSubexpressionBenchmark extends SqlBasedBenchmark {
private val N = 5 * 1000 * 1000
private def runCaseWhenRegexBenchmark(groupName: String, df: Dataset[_]):
Unit = {
val benchmark = new Benchmark(groupName, N, output = output)
val extracted = regexp_extract(col("s"), "(\\d+)", 1)
val conditionalRegexExpr = when(extracted =!= lit(""), extracted)
df.persist(StorageLevel.MEMORY_ONLY).count()
benchmark.addCase("conditionals disabled", 10) { _ =>
withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_CONDITIONALS_ENABLED.key
-> "false") {
df.select(conditionalRegexExpr.alias("value")).noop()
}
}
benchmark.addCase("conditionals enabled", 10) { _ =>
withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_CONDITIONALS_ENABLED.key
-> "true") {
df.select(conditionalRegexExpr.alias("value")).noop()
}
}
benchmark.run()
df.unpersist()
}
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val alwaysMatchDf = spark.range(N).select(
concat(lit("spark-"), (col("id") % 1000).cast("string"),
lit("-sql")).alias("s"))
runCaseWhenRegexBenchmark(
"Conditional regex subexpression elimination (CASE always matches)",
alwaysMatchDf)
val neverMatchDf =
spark.range(N).select(lit("spark-no-digits-sql").alias("s"))
runCaseWhenRegexBenchmark(
"Conditional regex subexpression elimination (CASE never matches)",
neverMatchDf)
}
}
```
```
[info] Running benchmark: Conditional regex subexpression elimination (CASE
always matches)
[info] Running case: conditionals disabled
[info] Stopped after 10 iterations, 26754 ms
[info] Running case: conditionals enabled
[info] Stopped after 10 iterations, 14592 ms
[info] OpenJDK 64-Bit Server VM 21.0.10 on Mac OS X 26.3
[info] Apple M1 Max
[info] Conditional regex subexpression elimination (CASE always matches):
Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative
[info]
-
[info] conditionals disabled
2642 2675 50 1.9 528.4 1.0X
[info] conditionals enabled
1447 1459 9 3.5 289.4 1.8X
[info] Running benchmark: Conditional regex subexpression elimination (CASE
never matches)
[info] Running case: conditionals disabled
[info] Stopped after 10 iterations, 15458 ms
[info] Running case: conditionals enabled
[info] Stopped after 10 iterations, 15703 ms
[info] OpenJDK 64-Bit Server VM 21.0.10 on Mac OS X 26.3
[info] Apple M1 Max
[info] Conditional regex subexpression elimination (CASE never matches):
Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative
[info]
[info] conditionals disabled
1538 1546 6 3.3 307.6 1.0X
[info] conditionals enabled
1562 1570 5 3.2 312.4 1.0X
```
Demonstrates the clear speedup and near negligible worst-case scenario where
the second case is never evaluated. In some cases the conditionals enabled case
was even faster in the worst case so it's basically a tossup. It seems like the
only hesitation was that worst-case where the expression was moved to a
function instead of inlined, resulting in additional function calls. But
considering there are a lot of benefits to splitting code into functions for
more JIT compilable functions, this seems like a non-issue. And even thinks
like simple math expressions still call functions instead of directly inlining
the operation.
While this example could be handled by the special `With` expression that
exists now for `nullif` and `between`, this has the benefit of working for all
types of expressions, and would lead into working with the subexpression
elimination inside higher order functions I have working in
https://github.com/apache/spark/pull/51272
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
Kimahriman commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1949018995
##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -255,4 +304,26 @@ case class ExpressionEquals(e: Expression) {
* Instead of appending to a mutable list/buffer of Expressions, just update
the "flattened"
* useCount in this wrapper in-place.
*/
-case class ExpressionStats(expr: Expression)(var useCount: Int)
+case class ExpressionStats(expr: Expression)(
+var useCount: Int = 1,
+var conditionalUseCount: Int = 0) {
+ def getUseCount(): Int = if (useCount > 0) {
+useCount + conditionalUseCount
+ } else {
+0
+ }
+}
+
+/**
+ * A wrapper for the different types of children of expressions.
`alwaysChildren` are child
+ * expressions that will always be evaluated and should be considered for
subexpressions.
+ * `commonChildren` are children such that if there are any common expressions
among them, those
+ * should be considered for subexpressions. `conditionalChildren` are children
that are
+ * conditionally evaluated, such as in If, CaseWhen, or Coalesce expressions,
and should only
+ * be considered for subexpressions if they are evaluated non-conditionally
elsewhere.
+ */
+case class RecurseChildren(
+alwaysChildren: Seq[Expression],
+commonChildren: Seq[Seq[Expression]] = Nil,
Review Comment:
Updated the docs a little bit to clarify. Currently it's only `If` and
`CaseWhen` expressions that `commonChildren` applies too, should I put one of
those as an example in the doc?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
Kimahriman commented on code in PR #32987: URL: https://github.com/apache/spark/pull/32987#discussion_r1948999862 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala: ## @@ -99,34 +111,37 @@ class EquivalentExpressions( * only the common nodes. * Those common nodes are then removed from the local map and added to the final map of * expressions. + * + * Conditional expressions are not considered because we are simply looking for expressions + * evaluated once in each parent expression. */ - private def updateCommonExprs( - exprs: Seq[Expression], - map: mutable.HashMap[ExpressionEquals, ExpressionStats], Review Comment: Yep good call, haven't kept up with with some of the docs -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
Kimahriman commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948999472
##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
}
case _ =>
if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+ ExpressionStats(expr)(0, useCount)
Review Comment:
Updated
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948345527
##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -255,4 +304,26 @@ case class ExpressionEquals(e: Expression) {
* Instead of appending to a mutable list/buffer of Expressions, just update
the "flattened"
* useCount in this wrapper in-place.
*/
-case class ExpressionStats(expr: Expression)(var useCount: Int)
+case class ExpressionStats(expr: Expression)(
+var useCount: Int = 1,
+var conditionalUseCount: Int = 0) {
+ def getUseCount(): Int = if (useCount > 0) {
+useCount + conditionalUseCount
+ } else {
+0
+ }
+}
+
+/**
+ * A wrapper for the different types of children of expressions.
`alwaysChildren` are child
+ * expressions that will always be evaluated and should be considered for
subexpressions.
+ * `commonChildren` are children such that if there are any common expressions
among them, those
+ * should be considered for subexpressions. `conditionalChildren` are children
that are
+ * conditionally evaluated, such as in If, CaseWhen, or Coalesce expressions,
and should only
+ * be considered for subexpressions if they are evaluated non-conditionally
elsewhere.
+ */
+case class RecurseChildren(
+alwaysChildren: Seq[Expression],
+commonChildren: Seq[Seq[Expression]] = Nil,
Review Comment:
do we have an example this `commonChildren`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
cloud-fan commented on code in PR #32987: URL: https://github.com/apache/spark/pull/32987#discussion_r1948344607 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala: ## @@ -99,34 +111,37 @@ class EquivalentExpressions( * only the common nodes. * Those common nodes are then removed from the local map and added to the final map of * expressions. + * + * Conditional expressions are not considered because we are simply looking for expressions + * evaluated once in each parent expression. */ - private def updateCommonExprs( - exprs: Seq[Expression], - map: mutable.HashMap[ExpressionEquals, ExpressionStats], Review Comment: shall we update the doc of this method? no equivalenceMap in this method now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729
##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
}
case _ =>
if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+ ExpressionStats(expr)(0, useCount)
Review Comment:
```suggestion
ExpressionStats(expr)(useCount = 0, conditionalUseCount =
useCount)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]
cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729
##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
}
case _ =>
if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+ ExpressionStats(expr)(0, useCount)
Review Comment:
```suggestion
ExpressionStats(expr)(useCount = 0, useCount)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
