Re: [PR] [SPARK-56032][SQL][FOLLOWUP] Skip FilterExec CSE codegen when the only common subexpression is a leaf [spark]

via GitHub Fri, 19 Jun 2026 01:21:24 -0700


LuciferYang commented on code in PR #56604:
URL: https://github.com/apache/spark/pull/56604#discussion_r3441165030



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##########
@@ -1225,6 +1225,48 @@ class WholeStageCodegenSuite extends SharedSparkSession
         "CSE-disabled codegen (i.e. fall back to the lazy, short-circuiting 
non-CSE path)")
   }
 
+  test("SPARK-56032: FilterExec skips CSE codegen when the only common 
subexpression is a leaf") {

Review Comment:
   Seems the motivation story and the new test case both rest on a premise that 
does not hold: 
   *"`c BETWEEN lo AND hi` lowers to `c >= lo AND c <= hi`, so any `BETWEEN` 
makes that column a common subexpression."*
   
   Two reasons it does not:
   
   1. `EquivalentExpressions.updateExprTree` skips `LeafExpression` outright 
(`val skip = useCount == 0 || expr.isInstanceOf[LeafExpression]`). 
`BoundReference` and `Attribute` are both `LeafExpression`, so a bare column 
never enters the equivalence map regardless of how many times it appears.
    
   2. `FilterExec.splitConjunctivePredicates` splits `q BETWEEN 2 AND 6` into 
`q >= 2` and `q <= 6` as two independent conjuncts, fed to `addExprTree` 
separately. Across separate `addExprTree` calls a leaf cannot be deduplicated.
   
   I verified this by dumping 
`otherPredsEquivalentExpressions.getCommonSubexpressions` on real plans, both 
before and after this PR:
   
   | Filter | `commonSubexpressions` (pre-PR == post-PR) |
   | --- | --- |
   | `q IS NOT NULL AND q BETWEEN 2 AND 6 AND (p1 BETWEEN 8 AND 18 OR p2 
BETWEEN 5 AND 9)` (this PR's test) | empty |
   | The actual q28 SQL (Int + Decimal columns) | empty |
   | `s.x > 5 AND s.x < 100` (`struct<x:int>`) | `GetStructField` |
   | `(a+b) > 0 AND (a+b) < 100` | `Add` |
   
   So:
   
   - The "3TB q28 ~40% slowdown" was **already fixed by #56209**'s `nonEmpty` 
gate. This PR does not change q28 codegen.
   - The new test claims to cover the leaf-only common-subexpression case, but 
the underlying `commonSubexpressions` set is actually empty there — it is 
exercising the same fallback branch as the existing `SPARK-56032: FilterExec 
skips CSE codegen when there is no common subexpression` test (lines 
1190–1226). Both old and new gate fall back, so the assertion holds vacuously.
   
   Suggested fixes:
   
   1. Rewrite the PR description / commit message to drop the q28 → leaf causal 
chain. The real motivation is: avoid taking the CSE path for cheap *non-leaf* 
expressions that `getCommonSubexpressions` can return — `GetStructField` is the 
realistic case (foldables almost always become `Literal` via `ConstantFolding`, 
and `Literal` is a leaf and therefore skipped).
   2. Replace the new test with a shape that actually exercises the new 
behavior, e.g. `s.x > 5 AND s.x < 100` over `struct<x:int>`. Pre-PR 
`getCommonSubexpressions = [GetStructField]` triggers the CSE path; post-PR it 
falls back.



##########
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##########
@@ -1225,6 +1225,48 @@ class WholeStageCodegenSuite extends SharedSparkSession
         "CSE-disabled codegen (i.e. fall back to the lazy, short-circuiting 
non-CSE path)")
   }
 
+  test("SPARK-56032: FilterExec skips CSE codegen when the only common 
subexpression is a leaf") {

Review Comment:
   Missing the dual test: when `otherPreds` contain a genuinely non-cheap 
common subexpression (e.g. `(a + b) > 0 AND (a + b) < 100`, where 
`getCommonSubexpressions = [Add(a, b)]` and `isCheap = false`), the gate should 
take the CSE path and the generated code should differ from the CSE-disabled 
version. The existing `FilterExec CSE` tests cover semantic correctness but do 
not pin down "when CSE *should* fire, it does." This PR tightens the gate by 
one notch; a dual assertion would prevent a future change from tightening it 
further (mistakenly classifying something as cheap).
   



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1561,7 +1561,8 @@ object CollapseProject extends Rule[LogicalPlan] with 
AliasHelper {
    * Check if the given expression is cheap that we can inline it.
    */
   def isCheap(e: Expression): Boolean = e match {

Review Comment:
   Worth a one-line doc update on `isCheap`: it is now consumed both by 
logical-stage callers (matching `Attribute`, see `expressions.scala:883-889`, 
`Optimizer.scala:1353/1524`, `RewriteWithExpression.scala:124`) and by the 
`FilterExec` codegen gate (matching `BoundReference`). The new `BoundReference` 
branch only fires on the codegen path — logical plans never carry 
`BoundReference` — so existing callers are unaffected. Calling that out 
explicitly makes the cross-layer reuse easier to justify and removes ambiguity 
for future readers.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-56032][SQL][FOLLOWUP] Skip FilterExec CSE codegen when the only common subexpression is a leaf [spark]

Reply via email to