alamb opened a new issue, #18952:
URL: https://github.com/apache/datafusion/issues/18952

   ### Is your feature request related to a problem or challenge?
   
   According to https://datafusion.apache.org/user-guide/configs.html
   
   The `datafusion.execution.collect_statistics` defaults to `true` which means 
the statistics should be gathered as part of the `CREATE TABLE` (and subsequent 
queries should not have to recreate `Statistics` for individual files)
   
   datafusion.execution.collect_statistics | true | Should DataFusion collect 
statistics when first creating a table. Has no effect after the table is 
created. Applies to the defaultĀ ListingTableProviderĀ in DataFusion. Defaults to 
true.
   -- | -- | --
   
   This behavior was introduced in the following issue for precisely this reason
   - https://github.com/apache/datafusion/issues/16158
   
   
   However, while reviewing ClickBench performance results, it seems to me that 
the statistics are actually computed on first **use**:
   -  see 
https://github.com/apache/datafusion/issues/18909#issuecomment-3577810018 for 
more details
   
   This means all our ClickBench results include the time to gather / convert 
statistics for 100 files. Other systems such as DuckDB do not pay this penalty 
(they gather the statistics on creation) and thus appear much faster in the 
leaderboard: https://benchmark.clickhouse.com
   
   
   You can reproduce this like
   
   Get data:
   ```shell
   cd ~/Software/datafusion
    cd benchmarks/
   # get data
   ./bench.sh data clickbench_partitioned
   ```
   
   Then run the queries:
   
   ```sql
   -- Create the external table -- this SHOULD gather statistics
   CREATE EXTERNAL TABLE hits
   STORED AS PARQUET
   LOCATION 'data/hits_partitioned'
   OPTIONS ('binary_as_string' 'true');
   
   -- This query should be really fast (1ms) as it can be run directly from 
statistics
   -- however, it takes 17ms-18ms on my machine, and 32ms on c6x.* class 
machines in EC2
   SELECT COUNT(*) FROM hits;
   -- This query should take about the same time as the one above
   SELECT COUNT(*) FROM hits;
   ```
   
   ```shell
   (venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks$ 
datafusion-cli -f q.sql
   DataFusion CLI v51.0.0
   0 row(s) fetched.
   Elapsed 0.021 seconds.
   
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 0.017 seconds.
   
   +----------+
   | count(*) |
   +----------+
   | 99997497 |
   +----------+
   1 row(s) fetched.
   Elapsed 0.001 seconds.
   ```
   
   
   
   ### Describe the solution you'd like
   
   I would like the statistics to *actually* be gathered on `CREATE TABLE`
   
    
   
   
   
   ### Describe alternatives you've considered
   
   It  appear that the statistics are cached on `DefaultFileStatisticsCache`, 
which created when the ListingTable and instantiated (but not populated) when 
the table is created the first time: 
https://github.com/apache/datafusion/blob/3c21b546a9acf9922229220d3ceca91a945cbf46/datafusion/catalog-listing/src/table.rs#L227-L226
   
   I think we could maybe "prewarm" the statistics cache here somehow:
   
https://github.com/apache/datafusion/blob/4ea76017193ca7ddac525b068d52cec1b3fcfeb0/datafusion/core/src/datasource/listing_table_factory.rs#L187-L188
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to