Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2026-03-05 Thread via GitHub


Kimahriman commented on PR #32987:
URL: https://github.com/apache/spark/pull/32987#issuecomment-4006359564

   Wrote a benchmark for this
   
   ```scala
   object ConditionalSubexpressionBenchmark extends SqlBasedBenchmark {
 private val N = 5 * 1000 * 1000
   
 private def runCaseWhenRegexBenchmark(groupName: String, df: Dataset[_]): 
Unit = {
   val benchmark = new Benchmark(groupName, N, output = output)
   
   val extracted = regexp_extract(col("s"), "(\\d+)", 1)
   val conditionalRegexExpr = when(extracted =!= lit(""), extracted)
   
   df.persist(StorageLevel.MEMORY_ONLY).count()
   
   benchmark.addCase("conditionals disabled", 10) { _ =>
 withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_CONDITIONALS_ENABLED.key 
-> "false") {
   df.select(conditionalRegexExpr.alias("value")).noop()
 }
   }
   
   benchmark.addCase("conditionals enabled", 10) { _ =>
 withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_CONDITIONALS_ENABLED.key 
-> "true") {
   df.select(conditionalRegexExpr.alias("value")).noop()
 }
   }
   
   benchmark.run()
   
   df.unpersist()
 }
   
 override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
   val alwaysMatchDf = spark.range(N).select(
 concat(lit("spark-"), (col("id") % 1000).cast("string"), 
lit("-sql")).alias("s"))
   runCaseWhenRegexBenchmark(
 "Conditional regex subexpression elimination (CASE always matches)",
 alwaysMatchDf)
   
   val neverMatchDf = 
spark.range(N).select(lit("spark-no-digits-sql").alias("s"))
   runCaseWhenRegexBenchmark(
 "Conditional regex subexpression elimination (CASE never matches)",
 neverMatchDf)
 }
   }
   ```
   
   ```
   [info] Running benchmark: Conditional regex subexpression elimination (CASE 
always matches)
   [info]   Running case: conditionals disabled
   [info]   Stopped after 10 iterations, 26754 ms
   [info]   Running case: conditionals enabled
   [info]   Stopped after 10 iterations, 14592 ms
   [info] OpenJDK 64-Bit Server VM 21.0.10 on Mac OS X 26.3
   [info] Apple M1 Max
   [info] Conditional regex subexpression elimination (CASE always matches):  
Best Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 
-
   [info] conditionals disabled 
  2642   2675  50  1.9 528.4   1.0X
   [info] conditionals enabled  
  1447   1459   9  3.5 289.4   1.8X
   [info] Running benchmark: Conditional regex subexpression elimination (CASE 
never matches)
   [info]   Running case: conditionals disabled
   [info]   Stopped after 10 iterations, 15458 ms
   [info]   Running case: conditionals enabled
   [info]   Stopped after 10 iterations, 15703 ms
   [info] OpenJDK 64-Bit Server VM 21.0.10 on Mac OS X 26.3
   [info] Apple M1 Max
   [info] Conditional regex subexpression elimination (CASE never matches):  
Best Time(ms)   Avg Time(ms)   Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
   [info] 

   [info] conditionals disabled 
 1538   1546   6  3.3 307.6   1.0X
   [info] conditionals enabled  
 1562   1570   5  3.2 312.4   1.0X
   ```
   
   Demonstrates the clear speedup and near negligible worst-case scenario where 
the second case is never evaluated. In some cases the conditionals enabled case 
was even faster in the worst case so it's basically a tossup. It seems like the 
only hesitation was that worst-case where the expression was moved to a 
function instead of inlined, resulting in additional function calls. But 
considering there are a lot of benefits to splitting code into functions for 
more JIT compilable functions, this seems like a non-issue. And even thinks 
like simple math expressions still call functions instead of directly inlining 
the operation.
   
   While this example could be handled by the special `With` expression that 
exists now for `nullif` and `between`, this has the benefit of working for all 
types of expressions, and would lead into working with the subexpression 
elimination inside higher order functions I have working in 
https://github.com/apache/spark/pull/51272


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-10 Thread via GitHub


Kimahriman commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1949018995


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -255,4 +304,26 @@ case class ExpressionEquals(e: Expression) {
  * Instead of appending to a mutable list/buffer of Expressions, just update 
the "flattened"
  * useCount in this wrapper in-place.
  */
-case class ExpressionStats(expr: Expression)(var useCount: Int)
+case class ExpressionStats(expr: Expression)(
+var useCount: Int = 1,
+var conditionalUseCount: Int = 0) {
+  def getUseCount(): Int = if (useCount > 0) {
+useCount + conditionalUseCount
+  } else {
+0
+  }
+}
+
+/**
+ * A wrapper for the different types of children of expressions. 
`alwaysChildren` are child
+ * expressions that will always be evaluated and should be considered for 
subexpressions.
+ * `commonChildren` are children such that if there are any common expressions 
among them, those
+ * should be considered for subexpressions. `conditionalChildren` are children 
that are
+ * conditionally evaluated, such as in If, CaseWhen, or Coalesce expressions, 
and should only
+ * be considered for subexpressions if they are evaluated non-conditionally 
elsewhere.
+ */
+case class RecurseChildren(
+alwaysChildren: Seq[Expression],
+commonChildren: Seq[Seq[Expression]] = Nil,

Review Comment:
   Updated the docs a little bit to clarify. Currently it's only `If` and 
`CaseWhen` expressions that `commonChildren` applies too, should I put one of 
those as an example in the doc?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-10 Thread via GitHub


Kimahriman commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948999862


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -99,34 +111,37 @@ class EquivalentExpressions(
* only the common nodes.
* Those common nodes are then removed from the local map and added to the 
final map of
* expressions.
+   *
+   * Conditional expressions are not considered because we are simply looking 
for expressions
+   * evaluated once in each parent expression.
*/
-  private def updateCommonExprs(
-  exprs: Seq[Expression],
-  map: mutable.HashMap[ExpressionEquals, ExpressionStats],

Review Comment:
   Yep good call, haven't kept up with with some of the docs



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-10 Thread via GitHub


Kimahriman commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948999472


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
   }
 case _ =>
   if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+  ExpressionStats(expr)(0, useCount)

Review Comment:
   Updated



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948345527


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -255,4 +304,26 @@ case class ExpressionEquals(e: Expression) {
  * Instead of appending to a mutable list/buffer of Expressions, just update 
the "flattened"
  * useCount in this wrapper in-place.
  */
-case class ExpressionStats(expr: Expression)(var useCount: Int)
+case class ExpressionStats(expr: Expression)(
+var useCount: Int = 1,
+var conditionalUseCount: Int = 0) {
+  def getUseCount(): Int = if (useCount > 0) {
+useCount + conditionalUseCount
+  } else {
+0
+  }
+}
+
+/**
+ * A wrapper for the different types of children of expressions. 
`alwaysChildren` are child
+ * expressions that will always be evaluated and should be considered for 
subexpressions.
+ * `commonChildren` are children such that if there are any common expressions 
among them, those
+ * should be considered for subexpressions. `conditionalChildren` are children 
that are
+ * conditionally evaluated, such as in If, CaseWhen, or Coalesce expressions, 
and should only
+ * be considered for subexpressions if they are evaluated non-conditionally 
elsewhere.
+ */
+case class RecurseChildren(
+alwaysChildren: Seq[Expression],
+commonChildren: Seq[Seq[Expression]] = Nil,

Review Comment:
   do we have an example this `commonChildren`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948344607


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -99,34 +111,37 @@ class EquivalentExpressions(
* only the common nodes.
* Those common nodes are then removed from the local map and added to the 
final map of
* expressions.
+   *
+   * Conditional expressions are not considered because we are simply looking 
for expressions
+   * evaluated once in each parent expression.
*/
-  private def updateCommonExprs(
-  exprs: Seq[Expression],
-  map: mutable.HashMap[ExpressionEquals, ExpressionStats],

Review Comment:
   shall we update the doc of this method? no equivalenceMap in this method now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
   }
 case _ =>
   if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+  ExpressionStats(expr)(0, useCount)

Review Comment:
   ```suggestion
 ExpressionStats(expr)(useCount = 0, conditionalUseCount = 
useCount)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] [SPARK-35564][SQL] Support subexpression elimination for conditionally evaluated expressions [spark]

2025-02-09 Thread via GitHub


cloud-fan commented on code in PR #32987:
URL: https://github.com/apache/spark/pull/32987#discussion_r1948343729


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/EquivalentExpressions.scala:
##
@@ -79,7 +86,12 @@ class EquivalentExpressions(
   }
 case _ =>
   if (useCount > 0) {
-map.put(wrapper, ExpressionStats(expr)(useCount))
+val stats = if (conditional) {
+  ExpressionStats(expr)(0, useCount)

Review Comment:
   ```suggestion
 ExpressionStats(expr)(useCount = 0, useCount)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]