alamb opened a new pull request #759: URL: https://github.com/apache/arrow-datafusion/pull/759
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Resolves https://github.com/apache/arrow-datafusion/issues/733 # Rationale for this Change Previously, only some logical optimizer passes (and no physical optimizer passes) were shown in `EXPLAIN VERBOSE` output. This was due to the fact that each optimizer had to special case handling for explain and so unsurprisingly some (especially newly written ones) did not. # What changes are included in this PR? 1. Handle capturing logical optimizer output in `ExecutionContext::Optimize` 3. Remove old "optimize_for_explain" plumbing 3. Show plans that are no different than the previous as "SAME TEXT AS ABOVE" 3. Capture physical optimizer output in PhysicalPlanner 3. Clean up how StringifiedPlans are created using traits # Are there any user-facing changes? Yes. Explain output is different. To see the difference, do ```shell echo "1,2" > /tmp/foo.csv cargo run --bin datafusion-cli ``` Then run ```sql CREATE EXTERNAL TABLE foo(c1 int, c2 int) STORED AS CSV LOCATION '/tmp/foo.csv'; EXPLAIN VERBOSE SELECT * from foo; ``` ## Before this change: Note the reason the optimizer passes appear to be duplicated in this explain is because that is what actually happens -- optimize is called once as part of `ExecutionContext::sql()` and again as part of `DataFrame_impl::collect()`). If we want to avoid the double optimization, I think we should treat it separately and do so in a follow on PR. This PR faithfully captures what DataFusion is actually doing. ``` +-----------------------------------------+--------------------------------------------------------------------------+ | plan_type | plan | +-----------------------------------------+--------------------------------------------------------------------------+ | initial_logical_plan | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=None | | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after limit_push_down | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after limit_push_down | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | initial_physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2] | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false | | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2] | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false | +-----------------------------------------+--------------------------------------------------------------------------+ ``` ## After this change: ``` > EXPLAIN VERBOSE SELECT * from foo; +-------------------------------------------+--------------------------------------------------------------------------+ | plan_type | plan | +-------------------------------------------+--------------------------------------------------------------------------+ | initial_logical_plan | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=None | | logical_plan after constant_folding | SAME TEXT AS ABOVE | | logical_plan after eliminate_limit | SAME TEXT AS ABOVE | | logical_plan after aggregate_statistics | SAME TEXT AS ABOVE | | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | logical_plan after filter_push_down | SAME TEXT AS ABOVE | | logical_plan after simplify_expressions | SAME TEXT AS ABOVE | | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE | | logical_plan after limit_push_down | SAME TEXT AS ABOVE | | logical_plan after constant_folding | SAME TEXT AS ABOVE | | logical_plan after eliminate_limit | SAME TEXT AS ABOVE | | logical_plan after aggregate_statistics | SAME TEXT AS ABOVE | | logical_plan after projection_push_down | SAME TEXT AS ABOVE | | logical_plan after filter_push_down | SAME TEXT AS ABOVE | | logical_plan after simplify_expressions | SAME TEXT AS ABOVE | | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE | | logical_plan after limit_push_down | SAME TEXT AS ABOVE | | logical_plan | Projection: #foo.c1, #foo.c2 | | | TableScan: foo projection=Some([0, 1]) | | initial_physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2] | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false | | physical_plan after coalesce_batches | SAME TEXT AS ABOVE | | physical_plan after repartition | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2] | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false | | physical_plan after add_merge_exec | SAME TEXT AS ABOVE | | physical_plan | ProjectionExec: expr=[c1@0 as c1, c2@1 as c2] | | | RepartitionExec: partitioning=RoundRobinBatch(16) | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false | +-------------------------------------------+--------------------------------------------------------------------------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
