andygrove opened a new pull request, #3220:
URL: https://github.com/apache/datafusion-comet/pull/3220
## Summary
This PR introduces an **experimental** lightweight cost-based optimizer
(CBO) that estimates whether a Comet query plan will be faster than a Spark
plan, falling back to Spark when Comet execution is estimated to be slower.
> **⚠️ EXPERIMENTAL**: This feature is disabled by default and should be
considered experimental. The cost model parameters are initial estimates and
should be tuned with real-world benchmarks before being used in production.
## Key Features
- **Heuristic-based cost model** with configurable weights for different
operator types:
- Scan, Filter, Project, Aggregate, Join, Sort
- **Configurable speedup factors** for each Comet operator type
- **Transition penalty** for columnar↔row conversions
- **Cardinality estimation** using Spark's logical plan statistics with
fallbacks
- **CBO analysis in EXPLAIN output** when enabled
## Configuration Options
| Config | Default | Description |
|--------|---------|-------------|
| `spark.comet.cbo.enabled` | `false` | Enable/disable CBO |
| `spark.comet.cbo.speedupThreshold` | `1.0` | Minimum estimated speedup
required to use Comet |
| `spark.comet.cbo.explain.enabled` | `false` | Log CBO decision details |
Additional internal configs for tuning weights and speedup factors are
available (see `CometConf.scala`).
## How It Works
1. After `CometExecRule` transforms operators to Comet equivalents, CBO
analyzes the plan
2. Collects statistics: Comet operators, Spark operators, transitions
3. Estimates costs for both Spark-only and Comet execution
4. Calculates estimated speedup = sparkCost / cometCost
5. If speedup < threshold, falls back to original Spark plan
## Limitations
- **CBO only affects operator conversion** (filter, project, aggregate,
join, sort)
- **Scan conversion is NOT affected** - handled separately by `CometScanRule`
- Cost model parameters are estimates and need tuning with benchmarks
## Example Usage
```scala
// Enable CBO with default threshold
spark.conf.set("spark.comet.cbo.enabled", "true")
// Or with custom threshold (use Comet only if 1.5x faster)
spark.conf.set("spark.comet.cbo.speedupThreshold", "1.5")
// Enable debug logging
spark.conf.set("spark.comet.cbo.explain.enabled", "true")
```
## Files Changed
- **New**: `CometCostEstimator.scala` - Core cost estimation logic
- **New**: `CometCBOSuite.scala` - Unit tests
- **Modified**: `CometConf.scala` - Configuration options
- **Modified**: `CometExecRule.scala` - CBO integration
- **Modified**: `ExtendedExplainInfo.scala` - CBO info in EXPLAIN
## Test plan
- [x] New unit tests in `CometCBOSuite` (13 tests)
- [x] Existing `CometExecRuleSuite` tests pass
- [ ] Manual testing with TPC-H/TPC-DS benchmarks (future work)
- [ ] Tune default parameters based on benchmark results (future work)
🤖 Generated with [Claude Code](https://claude.ai/code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]