[GitHub] [arrow] Dandandan commented on pull request #9639: ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan

GitBox Sat, 06 Mar 2021 01:57:20 -0800


Dandandan commented on pull request #9639:
URL: https://github.com/apache/arrow/pull/9639#issuecomment-791905238



   @alamb 
   
   > Ideally in my mind we would be able to run the optimizations twice (so we 
could do it with the initial call to sql but then if someone added more 
grouping or reparitioning or something, we could run the optimizer passes again.
   
   I removed the check / added a test for the projection pushdown that it 
returns the same plan when optimizing twice and removed the check. I am not 
sure what the check was trying to prevent? It seems it passes all the tests 
(which use sql + collect quite often). 
   
   > @Dandandan something I have been thinking recently (as I prepared for my 
talk next week on DataFusion as well as talking with @NGA-TRAN on my team at 
Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in 
fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.
   
   > I almost wonder if we should combine the two somehow... I don't have a 
concrete proposal now just thinking
   
   Thanks. Yeah `DataFrame` and `LogicalPlan` are pretty similar, not sure 
whether there is anything to change about it? As the `DataFrame` is just a 
higher level layer over the `LogicalPlan`.
   I think maybe some methods in `ExecutionContext` can be changed / deprecated 
so users will be nudged to use `DataFrame`s more?
   
   For example the public function `create_logical_plan` has a comment "This 
function is intended for internal use and should not be called directly", but 
both the tpc-h benchmarks and flight-server example do use more operations on 
the logical physical plan, but probably should use the 
`sql`/`Dataframe::collect` API instead. Snippet from flight_server example:
   
   ```
                   let plan = ctx
                       .create_logical_plan(&sql)
                       .and_then(|plan| ctx.optimize(&plan))
                       .and_then(|plan| ctx.create_physical_plan(&plan))
                       .map_err(|e| to_tonic_err(&e))?;
   
                   // execute the query
                   let results =
                       collect(plan.clone()).await.map_err(|e| 
to_tonic_err(&e))?;
   ```
   
    
   But this PR now runs the optimizer twice if you use `sql` + `.collect`.
   I am not sure what would the expected end result be. I guess one could keep 
some kind of flag that a certain node of a plan is optimized, and when it is 
the root it doesn't run a full optimization again, but maybe that's not worth 
it.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] Dandandan commented on pull request #9639: ARROW-11879 [Rust][DataFusion] Make ExecutionContext::sql return dataframe with optimized plan

Reply via email to