[I] Explore cost-based optimizations [datafusion-comet]

via GitHub Thu, 15 Aug 2024 06:49:24 -0700


andygrove opened a new issue, #833:
URL: https://github.com/apache/datafusion-comet/issues/833


   ### What is the problem the feature request solves?
   
   I recently worked on a few experimental PRs around CBO that I am going to 
close because they did not provide an immediate benefit, but I wanted to 
document the work so am using this issue for that.
   
   ## Using DataFusion's Physical Optimizer
   
   DataFusion provides a physical optimizer and there may be benefit in the 
future from applying DataFusion's rules or custom Comet rules. For example, 
injecting CopyExec into the plan would be ideal for an optimizer rule.
   
   In `jni_api.rs` we would need to add the rules that we want to enable:
   
   ```rust
     let state = SessionStateBuilder::new()
            .with_config(session_config)
            .with_runtime_env(Arc::new(runtime))
            .with_default_features()
            
.with_physical_optimizer_rules(vec![Arc::new(TopKAggregation::new())])
            .build();
   ```
   
   Then in `planner.rs` we could add the call to optimize the plan:
   
   ```rust
      pub fn optimize_plan(
            &self,
            plan: Arc<dyn ExecutionPlan>,
        ) -> Result<Arc<dyn ExecutionPlan>, ExecutionError> {
            // optimize the physical plan
            let datafusion_planner = DefaultPhysicalPlanner::default();
            datafusion_planner
                .optimize_physical_plan(plan, &self.session_ctx.state(), |_, _| 
{})
                .map_err(|e| e.into())
        }
   ```
   
   Because we receive a plan that is already optimized by Spark, there is no 
immediate benefit in enabling the current rules from DataFusion.
   
   ## Passing statistics down to the native plan
   
   It is possible for use to pass Spark statistics down to the native plan. For 
example, we can add this `QueryPlanSerde.scala`:
   
   ```scala
           op match {
              case qs: QueryStageExec =>
                qs.computeStats() match {
                  case Some(stats) =>
                    val statsBuilder = 
OperatorOuterClass.Statistics.newBuilder()
                    stats.rowCount.foreach(c => 
statsBuilder.setRowCount(c.toFloat))
                    statsBuilder.setSizeInBytes(stats.sizeInBytes.toFloat)
                    scanBuilder.setStatistics(statsBuilder.build())
                  case _ =>
                }
              case _ =>
            }
   ```
   
   It is also possible to get size in bytes from any Hadoop input relations and 
infer row count based on schema.
   
   There is no value in doing this though until we have optimizer rules that 
can make use of this data.
   
   ## Cost-model to determine when to fall back to Spark
   
   We could implement a cost-model with the relative costs of Comet vs Spark 
operators and expressions and fall back to Spark in the case where we estimate 
that Comet would be more expensive.
   
   
   ### Describe the potential solution
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Explore cost-based optimizations [datafusion-comet]

Reply via email to