LuciferYang commented on code in PR #56604:
URL: https://github.com/apache/spark/pull/56604#discussion_r3441165030
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##########
@@ -1225,6 +1225,48 @@ class WholeStageCodegenSuite extends SharedSparkSession
"CSE-disabled codegen (i.e. fall back to the lazy, short-circuiting
non-CSE path)")
}
+ test("SPARK-56032: FilterExec skips CSE codegen when the only common
subexpression is a leaf") {
Review Comment:
Seems the motivation story and the new test case both rest on a premise that
does not hold:
*"`c BETWEEN lo AND hi` lowers to `c >= lo AND c <= hi`, so any `BETWEEN`
makes that column a common subexpression."*
Two reasons it does not:
1. `EquivalentExpressions.updateExprTree` skips `LeafExpression` outright
(`val skip = useCount == 0 || expr.isInstanceOf[LeafExpression]`).
`BoundReference` and `Attribute` are both `LeafExpression`, so a bare column
never enters the equivalence map regardless of how many times it appears.
2. `FilterExec.splitConjunctivePredicates` splits `q BETWEEN 2 AND 6` into
`q >= 2` and `q <= 6` as two independent conjuncts, fed to `addExprTree`
separately. Across separate `addExprTree` calls a leaf cannot be deduplicated.
I verified this by dumping
`otherPredsEquivalentExpressions.getCommonSubexpressions` on real plans, both
before and after this PR:
| Filter | `commonSubexpressions` (pre-PR == post-PR) |
| --- | --- |
| `q IS NOT NULL AND q BETWEEN 2 AND 6 AND (p1 BETWEEN 8 AND 18 OR p2
BETWEEN 5 AND 9)` (this PR's test) | empty |
| The actual q28 SQL (Int + Decimal columns) | empty |
| `s.x > 5 AND s.x < 100` (`struct<x:int>`) | `GetStructField` |
| `(a+b) > 0 AND (a+b) < 100` | `Add` |
So:
- The "3TB q28 ~40% slowdown" was **already fixed by #56209**'s `nonEmpty`
gate. This PR does not change q28 codegen.
- The new test claims to cover the leaf-only common-subexpression case, but
the underlying `commonSubexpressions` set is actually empty there — it is
exercising the same fallback branch as the existing `SPARK-56032: FilterExec
skips CSE codegen when there is no common subexpression` test (lines
1190–1226). Both old and new gate fall back, so the assertion holds vacuously.
Suggested fixes:
1. Rewrite the PR description / commit message to drop the q28 → leaf causal
chain. The real motivation is: avoid taking the CSE path for cheap *non-leaf*
expressions that `getCommonSubexpressions` can return — `GetStructField` is the
realistic case (foldables almost always become `Literal` via `ConstantFolding`,
and `Literal` is a leaf and therefore skipped).
2. Replace the new test with a shape that actually exercises the new
behavior, e.g. `s.x > 5 AND s.x < 100` over `struct<x:int>`. Pre-PR
`getCommonSubexpressions = [GetStructField]` triggers the CSE path; post-PR it
falls back.
##########
sql/core/src/test/scala/org/apache/spark/sql/execution/WholeStageCodegenSuite.scala:
##########
@@ -1225,6 +1225,48 @@ class WholeStageCodegenSuite extends SharedSparkSession
"CSE-disabled codegen (i.e. fall back to the lazy, short-circuiting
non-CSE path)")
}
+ test("SPARK-56032: FilterExec skips CSE codegen when the only common
subexpression is a leaf") {
Review Comment:
Missing the dual test: when `otherPreds` contain a genuinely non-cheap
common subexpression (e.g. `(a + b) > 0 AND (a + b) < 100`, where
`getCommonSubexpressions = [Add(a, b)]` and `isCheap = false`), the gate should
take the CSE path and the generated code should differ from the CSE-disabled
version. The existing `FilterExec CSE` tests cover semantic correctness but do
not pin down "when CSE *should* fire, it does." This PR tightens the gate by
one notch; a dual assertion would prevent a future change from tightening it
further (mistakenly classifying something as cheap).
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -1561,7 +1561,8 @@ object CollapseProject extends Rule[LogicalPlan] with
AliasHelper {
* Check if the given expression is cheap that we can inline it.
*/
def isCheap(e: Expression): Boolean = e match {
Review Comment:
Worth a one-line doc update on `isCheap`: it is now consumed both by
logical-stage callers (matching `Attribute`, see `expressions.scala:883-889`,
`Optimizer.scala:1353/1524`, `RewriteWithExpression.scala:124`) and by the
`FilterExec` codegen gate (matching `BoundReference`). The new `BoundReference`
branch only fires on the codegen path — logical plans never carry
`BoundReference` — so existing callers are unaffected. Calling that out
explicitly makes the cross-layer reuse easier to justify and removes ambiguity
for future readers.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]