andygrove opened a new pull request, #3220:
URL: https://github.com/apache/datafusion-comet/pull/3220

   ## Summary
   
   This PR introduces an **experimental** lightweight cost-based optimizer 
(CBO) that estimates whether a Comet query plan will be faster than a Spark 
plan, falling back to Spark when Comet execution is estimated to be slower.
   
   > **⚠️ EXPERIMENTAL**: This feature is disabled by default and should be 
considered experimental. The cost model parameters are initial estimates and 
should be tuned with real-world benchmarks before being used in production.
   
   ## Key Features
   
   - **Heuristic-based cost model** with configurable weights for different 
operator types:
     - Scan, Filter, Project, Aggregate, Join, Sort
   - **Configurable speedup factors** for each Comet operator type
   - **Transition penalty** for columnar↔row conversions
   - **Cardinality estimation** using Spark's logical plan statistics with 
fallbacks
   - **CBO analysis in EXPLAIN output** when enabled
   
   ## Configuration Options
   
   | Config | Default | Description |
   |--------|---------|-------------|
   | `spark.comet.cbo.enabled` | `false` | Enable/disable CBO |
   | `spark.comet.cbo.speedupThreshold` | `1.0` | Minimum estimated speedup 
required to use Comet |
   | `spark.comet.cbo.explain.enabled` | `false` | Log CBO decision details |
   
   Additional internal configs for tuning weights and speedup factors are 
available (see `CometConf.scala`).
   
   ## How It Works
   
   1. After `CometExecRule` transforms operators to Comet equivalents, CBO 
analyzes the plan
   2. Collects statistics: Comet operators, Spark operators, transitions
   3. Estimates costs for both Spark-only and Comet execution
   4. Calculates estimated speedup = sparkCost / cometCost
   5. If speedup < threshold, falls back to original Spark plan
   
   ## Limitations
   
   - **CBO only affects operator conversion** (filter, project, aggregate, 
join, sort)
   - **Scan conversion is NOT affected** - handled separately by `CometScanRule`
   - Cost model parameters are estimates and need tuning with benchmarks
   
   ## Example Usage
   
   ```scala
   // Enable CBO with default threshold
   spark.conf.set("spark.comet.cbo.enabled", "true")
   
   // Or with custom threshold (use Comet only if 1.5x faster)
   spark.conf.set("spark.comet.cbo.speedupThreshold", "1.5")
   
   // Enable debug logging
   spark.conf.set("spark.comet.cbo.explain.enabled", "true")
   ```
   
   ## Files Changed
   
   - **New**: `CometCostEstimator.scala` - Core cost estimation logic
   - **New**: `CometCBOSuite.scala` - Unit tests
   - **Modified**: `CometConf.scala` - Configuration options
   - **Modified**: `CometExecRule.scala` - CBO integration
   - **Modified**: `ExtendedExplainInfo.scala` - CBO info in EXPLAIN
   
   ## Test plan
   
   - [x] New unit tests in `CometCBOSuite` (13 tests)
   - [x] Existing `CometExecRuleSuite` tests pass
   - [ ] Manual testing with TPC-H/TPC-DS benchmarks (future work)
   - [ ] Tune default parameters based on benchmark results (future work)
   
   🤖 Generated with [Claude Code](https://claude.ai/code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to