andygrove commented on issue #942: URL: https://github.com/apache/datafusion-comet/issues/942#issuecomment-3885950423
## Notes from CSE optimizer implementation attempt (PR #3444) > **Note:** This summary was generated by AI (Claude) based on the issue history, PR #3444, and related upstream discussions. It may contain inaccuracies. Closing PR #3444 since the benchmarks did not show meaningful improvement, but documenting learnings here for anyone who wants to revisit this in the future. ### What was implemented A Rust-based physical CSE optimizer rule (`physical_cse.rs`, ~611 lines) that: - Runs as a DataFusion `PhysicalOptimizerRule` on the native plan before execution - Walks the plan bottom-up, looking for repeated subexpressions within `ProjectionExec` and `AggregateExec` nodes - When duplicates are found, inserts an intermediate `ProjectionExec` that computes the common subexpression once, then rewrites the parent node to reference the pre-computed column - Skips trivial expressions (columns, literals) and volatile expressions - When nested common subexpressions exist, only extracts the outermost one to avoid unused intermediate columns The PR also added: - A config `spark.comet.exec.nativePhysicalOptimizer.enabled` to toggle the optimizer - A mechanism to pass DataFusion configs from Spark via `spark.comet.datafusion.*` prefix ### What the benchmarks showed The rule correctly identified the common subexpression in TPC-H Q1 (`l_extendedprice * (1 - l_discount)`), but the automated TPC-H SF100 benchmark showed **0.0% total improvement** across all 22 queries. Q1 specifically showed +0.8% (within noise). This is surprising given that the original manual SQL rewrite in this issue showed ~10-12% improvement on Q1. Possible explanations for the gap: - The manual rewrite may have benefited from other Spark optimizer effects (e.g., different plan shape) - The overhead of the extra `ProjectionExec` node (memory allocation, column copying) may offset the savings from avoiding redundant computation - The benchmark environment (SF100 on a single machine with `local[*]`) may behave differently than the original test setup - CPU instruction caches and vectorization effects may make redundant evaluation cheaper than the extra projection indirection ### Related upstream work - DataFusion issue: https://github.com/apache/datafusion/issues/12599 (still open) - DataFusion PR: https://github.com/apache/datafusion/pull/13046 (closed/stale) — this was a more general approach by @peter-toth for CSE on `PhysicalExpr` trees. It was closed because: *"the main usecase for physical plan CSE is Comet, I will reopen this PR once we figured out the metrics related issues due to extra projections added by CSE."* ### Key challenges for future attempts 1. **Metrics tracking**: Extra `ProjectionExec` nodes inserted by CSE break the 1:1 mapping between Spark plan nodes and native plan nodes, which complicates metrics reporting back to Spark. This was also the blocker for the upstream DataFusion PR. 2. **Cost-benefit tradeoff**: For simple arithmetic (multiply, subtract), the cost of adding a projection (allocating arrays, copying data) can exceed the cost of just recomputing the expression. CSE may only pay off for expensive subexpressions (UDFs, complex string operations, regex, etc.). 3. **Spark already does codegen-level CSE**: As @eejbyfeldt [noted](https://github.com/apache/datafusion-comet/issues/942#issuecomment-2374777297), Spark handles this at the codegen level ([CodeGenerator.scala](https://github.com/apache/spark/blob/v3.5.3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala#L1064-L1098)), not at the plan level. A plan-level approach has inherently more overhead. ### Recommendations - If revisiting, consider targeting only expensive subexpressions rather than all duplicates - Consider implementing CSE at the expression evaluation level (like Spark does) rather than at the plan level, to avoid the overhead of extra projection nodes - The `spark.comet.datafusion.*` config passthrough mechanism from this PR may be worth landing independently, as it's useful beyond CSE -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
