[GitHub] [arrow-datafusion] alamb opened a new pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

GitBox Tue, 20 Jul 2021 09:25:14 -0700


alamb opened a new pull request #759:
URL: https://github.com/apache/arrow-datafusion/pull/759



   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Resolves https://github.com/apache/arrow-datafusion/issues/733
   
    # Rationale for this Change
   Previously, only some logical optimizer passes (and no physical optimizer 
passes) were shown in `EXPLAIN VERBOSE` output. This was due to the fact that 
each optimizer had to special case handling for explain and so unsurprisingly 
some (especially newly written ones) did not.
   
   # What changes are included in this PR?
   1. Handle capturing logical optimizer output in `ExecutionContext::Optimize`
   3. Remove old "optimize_for_explain" plumbing
   3. Show plans that are no different than the previous as "SAME TEXT AS ABOVE"
   3. Capture physical optimizer output in PhysicalPlanner
   3. Clean up how StringifiedPlans are created using traits
   
   # Are there any user-facing changes?
   Yes. Explain output is different. To see the difference, do
   
   ```shell
   echo "1,2" > /tmp/foo.csv
   cargo run --bin datafusion-cli
   ```
   
   Then run
   ```sql
   CREATE EXTERNAL TABLE foo(c1 int, c2 int)
   STORED AS CSV
   LOCATION '/tmp/foo.csv';
   
   EXPLAIN VERBOSE SELECT * from foo;
   ```
   
   ## Before this change:
   Note the reason the optimizer passes appear to be duplicated in this explain 
is because that is what actually happens -- optimize is called once as part of 
`ExecutionContext::sql()` and again as part of `DataFrame_impl::collect()`). If 
we want to avoid the double optimization, I think we should treat it separately 
and do so in a follow on PR. This PR faithfully captures what DataFusion is 
actually doing. 
   
   ```
   
+-----------------------------------------+--------------------------------------------------------------------------+
   | plan_type                               | plan                             
                                        |
   
+-----------------------------------------+--------------------------------------------------------------------------+
   | initial_logical_plan                    | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo projection=None 
                                        |
   | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after projection_push_down | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after simplify_expressions | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after limit_push_down      | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan                            | Projection: #foo.c1, #foo.c2     
                                        |
   |                                         |   TableScan: foo 
projection=Some([0, 1])                                 |
   | initial_physical_plan                   | ProjectionExec: expr=[c1@0 as 
c1, c2@1 as c2]                            |
   |                                         |   CsvExec: 
source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
   | physical_plan                           | ProjectionExec: expr=[c1@0 as 
c1, c2@1 as c2]                            |
   |                                         |   RepartitionExec: 
partitioning=RoundRobinBatch(16)                      |
   |                                         |     CsvExec: 
source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   
+-----------------------------------------+--------------------------------------------------------------------------+
   ```
   
   ## After this change:
   
   ```
   > EXPLAIN VERBOSE SELECT * from foo;
   
+-------------------------------------------+--------------------------------------------------------------------------+
   | plan_type                                 | plan                           
                                          |
   
+-------------------------------------------+--------------------------------------------------------------------------+
   | initial_logical_plan                      | Projection: #foo.c1, #foo.c2   
                                          |
   |                                           |   TableScan: foo 
projection=None                                         |
   | logical_plan after constant_folding       | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after eliminate_limit        | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after projection_push_down   | Projection: #foo.c1, #foo.c2   
                                          |
   |                                           |   TableScan: foo 
projection=Some([0, 1])                                 |
   | logical_plan after filter_push_down       | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after simplify_expressions   | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after limit_push_down        | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after constant_folding       | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after eliminate_limit        | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after aggregate_statistics   | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after projection_push_down   | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after filter_push_down       | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after simplify_expressions   | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after hash_build_probe_order | SAME TEXT AS ABOVE             
                                          |
   | logical_plan after limit_push_down        | SAME TEXT AS ABOVE             
                                          |
   | logical_plan                              | Projection: #foo.c1, #foo.c2   
                                          |
   |                                           |   TableScan: foo 
projection=Some([0, 1])                                 |
   | initial_physical_plan                     | ProjectionExec: expr=[c1@0 as 
c1, c2@1 as c2]                            |
   |                                           |   CsvExec: 
source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false   |
   | physical_plan after coalesce_batches      | SAME TEXT AS ABOVE             
                                          |
   | physical_plan after repartition           | ProjectionExec: expr=[c1@0 as 
c1, c2@1 as c2]                            |
   |                                           |   RepartitionExec: 
partitioning=RoundRobinBatch(16)                      |
   |                                           |     CsvExec: 
source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   | physical_plan after add_merge_exec        | SAME TEXT AS ABOVE             
                                          |
   | physical_plan                             | ProjectionExec: expr=[c1@0 as 
c1, c2@1 as c2]                            |
   |                                           |   RepartitionExec: 
partitioning=RoundRobinBatch(16)                      |
   |                                           |     CsvExec: 
source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false |
   
+-------------------------------------------+--------------------------------------------------------------------------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new pull request #759: Show the result of all optimizer passes in EXPLAIN VERBOSE

Reply via email to