alamb opened a new pull request #858: URL: https://github.com/apache/arrow-datafusion/pull/858
# Which issue does this PR close? Resolves https://github.com/apache/arrow-datafusion/issues/779 # Rationale for this change `EXPLAIN PLAN` is great to understand what DataFusion plans to do, but it is hard today using the SQL or dataframe interface to understand in more depth what *actually happened* during execution. My real usecase is being able to see how many rows flowed through each operator as well as the "time spent" and "rows produced" by each operator. # What changes are included in this PR? 1. Add basic plan nodes for `EXPLAIN ANALYZE` and EXPLAIN ANALYZE VERBOSE` (example below) sql 2. Refactor special case `ParquetStrream` into `RecordBatchReceiverStream` for reuse # Are there any user-facing changes? Yes, `EXPLAIN ANALYZE` now does something different than `EXPLAIN` ## Example of use ```shell echo "1,A" > /tmp/foo.csv echo "1,B" >> /tmp/foo.csv echo "2,A" >> /tmp/foo.csv ``` Run CLI ```shell cargo run --bin datafusion-cli ``` ```sql CREATE EXTERNAL TABLE foo(x INT, b VARCHAR) STORED AS CSV LOCATION '/tmp/foo.csv'; ``` ## Example `EXPLAIN ANALYZE` output ```sql > EXPLAIN ANALYZE SELECT SUM(x) FROM foo GROUP BY b; +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | CoalescePartitionsExec, metrics=[] | | | ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[] | | | HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2] | | | CoalesceBatchesExec: target_batch_size=4096, metrics=[] | | | RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[sendTime=839560, fetchTime=122528525, repartitionTime=5327877] | | | HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2] | | | RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[fetchTime=5660489, repartitionTime=0, sendTime=8012] | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[] | +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set. Query took 0.012 seconds. ``` ## Example `EXPLAIN ANALYZE VERBOSE` output ```sql > EXPLAIN ANALYZE VERBOSE SELECT SUM(x) FROM foo GROUP BY b; +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | plan_type | plan | +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ | Plan with Metrics | CoalescePartitionsExec, metrics=[] | | | ProjectionExec: expr=[SUM(foo.x)@1 as SUM(x)], metrics=[] | | | HashAggregateExec: mode=FinalPartitioned, gby=[b@0 as b], aggr=[SUM(x)], metrics=[outputRows=2] | | | CoalesceBatchesExec: target_batch_size=4096, metrics=[] | | | RepartitionExec: partitioning=Hash([Column { name: "b", index: 0 }], 16), metrics=[repartitionTime=6584110, fetchTime=132927514, sendTime=904001] | | | HashAggregateExec: mode=Partial, gby=[b@1 as b], aggr=[SUM(x)], metrics=[outputRows=2] | | | RepartitionExec: partitioning=RoundRobinBatch(16), metrics=[repartitionTime=0, sendTime=8246, fetchTime=6239096] | | | CsvExec: source=Path(/tmp/foo.csv: [/tmp/foo.csv]), has_header=false, metrics=[] | | Output Rows | 2 | | Duration | 10.283764ms | +-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+ 3 rows in set. Query took 0.014 seconds. ``` # Future work: Note this PR is designed just to hook up / plumb the existing code and metrics we have into SQL (basically what got added in https://github.com/apache/arrow-datafusion/pull/662). I plan a sequence of follow on PRs to both improve the metrics infrastructure https://github.com/apache/arrow-datafusion/issues/679 and add/fix the metrics that are actually reported so they are consistent. The specific metrics that are displayed are verbose and somewhat ad hoc at the moment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
