alamb opened a new issue, #19217: URL: https://github.com/apache/datafusion/issues/19217
### Is your feature request related to a problem or challenge? Right now, when a table is created via `CREATE EXTERNAL TABLE`, the underyling `ListingTable` reads and parses the statistics for each file. If the same table is queried, the cached statistics are re-read. However, if the same files are queried again, the statistics are re-read. ```sql -- has to read remote storage and calculate statistics select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; Elapsed 0.092 seconds. -- re-calculates the same statistics select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; Elapsed 0.092 seconds. ``` You can see the statistics are recalculated with the timings 0.092s If you use a `CREATE EXTERNAL TABLE`, which caches the statistics, it take sonly 0.025 seconds: ``` CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; > select count(*) from hits; +----------+ | count(*) | +----------+ | 1000000 | +----------+ 1 row(s) fetched. Elapsed 0.029 seconds. ``` In addition, since a new cache is created for each table, the great function `statistics_cache` added by @nuno-faria in https://github.com/apache/datafusion/pull/19054 doesn't show anything: ```sql > CREATE EXTERNAL TABLE hits stored as parquet location 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'; 0 row(s) fetched. Elapsed 0.405 seconds. > select * from statistics_cache(); +------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+ | path | file_modified | file_size_bytes | e_tag | version | num_rows | num_columns | table_size_bytes | statistics_size_bytes | +------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+ +------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+ 0 row(s) fetched. Elapsed 0.015 seconds. ``` ### Describe the solution you'd like I would like a session scoped FileStatisticsCache that is shared between statements / ListingTables , the same way the DefaultFilesMetadataCache is: created: https://github.com/apache/datafusion/blob/57c0dda7b512154dd8ee9abe4f07d0462759d224/datafusion/execution/src/cache/cache_manager.rs#L165-L171 The code for ListingTable somewhat unobviously sets a DefaultFileStatisticsCache here https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259 ### Describe alternatives you've considered _No response_ ### Additional context We probably also need to add a limit, like this: - https://github.com/apache/datafusion/issues/19052 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
