andygrove commented on issue #942:
URL: 
https://github.com/apache/datafusion-comet/issues/942#issuecomment-3885950423

   ## Notes from CSE optimizer implementation attempt (PR #3444)
   
   > **Note:** This summary was generated by AI (Claude) based on the issue 
history, PR #3444, and related upstream discussions. It may contain 
inaccuracies.
   
   Closing PR #3444 since the benchmarks did not show meaningful improvement, 
but documenting learnings here for anyone who wants to revisit this in the 
future.
   
   ### What was implemented
   
   A Rust-based physical CSE optimizer rule (`physical_cse.rs`, ~611 lines) 
that:
   - Runs as a DataFusion `PhysicalOptimizerRule` on the native plan before 
execution
   - Walks the plan bottom-up, looking for repeated subexpressions within 
`ProjectionExec` and `AggregateExec` nodes
   - When duplicates are found, inserts an intermediate `ProjectionExec` that 
computes the common subexpression once, then rewrites the parent node to 
reference the pre-computed column
   - Skips trivial expressions (columns, literals) and volatile expressions
   - When nested common subexpressions exist, only extracts the outermost one 
to avoid unused intermediate columns
   
   The PR also added:
   - A config `spark.comet.exec.nativePhysicalOptimizer.enabled` to toggle the 
optimizer
   - A mechanism to pass DataFusion configs from Spark via 
`spark.comet.datafusion.*` prefix
   
   ### What the benchmarks showed
   
   The rule correctly identified the common subexpression in TPC-H Q1 
(`l_extendedprice * (1 - l_discount)`), but the automated TPC-H SF100 benchmark 
showed **0.0% total improvement** across all 22 queries. Q1 specifically showed 
+0.8% (within noise). This is surprising given that the original manual SQL 
rewrite in this issue showed ~10-12% improvement on Q1.
   
   Possible explanations for the gap:
   - The manual rewrite may have benefited from other Spark optimizer effects 
(e.g., different plan shape)
   - The overhead of the extra `ProjectionExec` node (memory allocation, column 
copying) may offset the savings from avoiding redundant computation
   - The benchmark environment (SF100 on a single machine with `local[*]`) may 
behave differently than the original test setup
   - CPU instruction caches and vectorization effects may make redundant 
evaluation cheaper than the extra projection indirection
   
   ### Related upstream work
   
   - DataFusion issue: https://github.com/apache/datafusion/issues/12599 (still 
open)
   - DataFusion PR: https://github.com/apache/datafusion/pull/13046 
(closed/stale) — this was a more general approach by @peter-toth for CSE on 
`PhysicalExpr` trees. It was closed because: *"the main usecase for physical 
plan CSE is Comet, I will reopen this PR once we figured out the metrics 
related issues due to extra projections added by CSE."*
   
   ### Key challenges for future attempts
   
   1. **Metrics tracking**: Extra `ProjectionExec` nodes inserted by CSE break 
the 1:1 mapping between Spark plan nodes and native plan nodes, which 
complicates metrics reporting back to Spark. This was also the blocker for the 
upstream DataFusion PR.
   2. **Cost-benefit tradeoff**: For simple arithmetic (multiply, subtract), 
the cost of adding a projection (allocating arrays, copying data) can exceed 
the cost of just recomputing the expression. CSE may only pay off for expensive 
subexpressions (UDFs, complex string operations, regex, etc.).
   3. **Spark already does codegen-level CSE**: As @eejbyfeldt 
[noted](https://github.com/apache/datafusion-comet/issues/942#issuecomment-2374777297),
 Spark handles this at the codegen level 
([CodeGenerator.scala](https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1064-L1098)),
 not at the plan level. A plan-level approach has inherently more overhead.
   
   ### Recommendations
   
   - If revisiting, consider targeting only expensive subexpressions rather 
than all duplicates
   - Consider implementing CSE at the expression evaluation level (like Spark 
does) rather than at the plan level, to avoid the overhead of extra projection 
nodes
   - The `spark.comet.datafusion.*` config passthrough mechanism from this PR 
may be worth landing independently, as it's useful beyond CSE


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to