alamb opened a new issue, #18952: URL: https://github.com/apache/datafusion/issues/18952
### Is your feature request related to a problem or challenge? According to https://datafusion.apache.org/user-guide/configs.html The `datafusion.execution.collect_statistics` defaults to `true` which means the statistics should be gathered as part of the `CREATE TABLE` (and subsequent queries should not have to recreate `Statistics` for individual files) datafusion.execution.collect_statistics | true | Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the defaultĀ ListingTableProviderĀ in DataFusion. Defaults to true. -- | -- | -- This behavior was introduced in the following issue for precisely this reason - https://github.com/apache/datafusion/issues/16158 However, while reviewing ClickBench performance results, it seems to me that the statistics are actually computed on first **use**: - see https://github.com/apache/datafusion/issues/18909#issuecomment-3577810018 for more details This means all our ClickBench results include the time to gather / convert statistics for 100 files. Other systems such as DuckDB do not pay this penalty (they gather the statistics on creation) and thus appear much faster in the leaderboard: https://benchmark.clickhouse.com You can reproduce this like Get data: ```shell cd ~/Software/datafusion cd benchmarks/ # get data ./bench.sh data clickbench_partitioned ``` Then run the queries: ```sql -- Create the external table -- this SHOULD gather statistics CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'data/hits_partitioned' OPTIONS ('binary_as_string' 'true'); -- This query should be really fast (1ms) as it can be run directly from statistics -- however, it takes 17ms-18ms on my machine, and 32ms on c6x.* class machines in EC2 SELECT COUNT(*) FROM hits; -- This query should take about the same time as the one above SELECT COUNT(*) FROM hits; ``` ```shell (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ datafusion-cli -f q.sql DataFusion CLI v51.0.0 0 row(s) fetched. Elapsed 0.021 seconds. +----------+ | count(*) | +----------+ | 99997497 | +----------+ 1 row(s) fetched. Elapsed 0.017 seconds. +----------+ | count(*) | +----------+ | 99997497 | +----------+ 1 row(s) fetched. Elapsed 0.001 seconds. ``` ### Describe the solution you'd like I would like the statistics to *actually* be gathered on `CREATE TABLE` ### Describe alternatives you've considered It appear that the statistics are cached on `DefaultFileStatisticsCache`, which created when the ListingTable and instantiated (but not populated) when the table is created the first time: https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/catalog-listing/src/table.rs#L227-L226 I think we could maybe "prewarm" the statistics cache here somehow: https://github.com/apache/datafusion/blob/4ea76017193ca7ddac525b068d52cec1b3fcfeb0/datafusion/core/src/datasource/listing_table_factory.rs#L187-L188 ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
