Dandandan commented on pull request #9639:
URL: https://github.com/apache/arrow/pull/9639#issuecomment-791905238
@alamb
> Ideally in my mind we would be able to run the optimizations twice (so we
could do it with the initial call to sql but then if someone added more
grouping or reparitioning or something, we could run the optimizer passes again.
I removed the check / added a test for the projection pushdown that it
returns the same plan when optimizing twice and removed the check. I am not
sure what the check was trying to prevent? It seems it passes all the tests
(which use sql + collect quite often).
> @Dandandan something I have been thinking recently (as I prepared for my
talk next week on DataFusion as well as talking with @NGA-TRAN on my team at
Influx) was how similar the LogicalPlanBuilder and DataFrame APIs were (and in
fact the DataFrameImpl basically calls the functions on LogicalPlanBuilder.
> I almost wonder if we should combine the two somehow... I don't have a
concrete proposal now just thinking
Thanks. Yeah `DataFrame` and `LogicalPlan` are pretty similar, not sure
whether there is anything to change about it? As the `DataFrame` is just a
higher level layer over the `LogicalPlan`.
I think maybe some methods in `ExecutionContext` can be changed / deprecated
so users will be nudged to use `DataFrame`s more?
For example the public function `create_logical_plan` has a comment "This
function is intended for internal use and should not be called directly", but
both the tpc-h benchmarks and flight-server example do use more operations on
the logical physical plan, but probably should use the
`sql`/`Dataframe::collect` API instead. Snippet from flight_server example:
```
let plan = ctx
.create_logical_plan(&sql)
.and_then(|plan| ctx.optimize(&plan))
.and_then(|plan| ctx.create_physical_plan(&plan))
.map_err(|e| to_tonic_err(&e))?;
// execute the query
let results =
collect(plan.clone()).await.map_err(|e|
to_tonic_err(&e))?;
```
But this PR now runs the optimizer twice if you use `sql` + `.collect`.
I am not sure what would the expected end result be. I guess one could keep
some kind of flag that a certain node of a plan is optimized, and when it is
the root it doesn't run a full optimization again, but maybe that's not worth
it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]