alamb opened a new issue, #19217:
URL: https://github.com/apache/datafusion/issues/19217

   ### Is your feature request related to a problem or challenge?
   
   Right now, when a table is created via `CREATE EXTERNAL TABLE`,  the 
underyling `ListingTable` reads and parses the statistics for each file. If the 
same table is queried, the cached statistics are re-read. However, if the same 
files are queried again, the statistics are re-read. 
   
   
   
   ```sql
   -- has to read remote storage and calculate statistics
   select count(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   Elapsed 0.092 seconds.
   -- re-calculates the same statistics
   select count(*) from 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   Elapsed 0.092 seconds.
   ```
   
   You can see the statistics are recalculated with the timings 0.092s
   
   
   If you use a `CREATE EXTERNAL TABLE`, which caches the statistics, it take 
sonly 0.025 seconds:
   
   ```
   CREATE EXTERNAL TABLE hits stored as parquet location 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   > select count(*) from hits;
   +----------+
   | count(*) |
   +----------+
   | 1000000  |
   +----------+
   1 row(s) fetched.
   Elapsed 0.029 seconds.
   ``` 
   
   
   In addition, since a new cache is created for each table,  the great 
function `statistics_cache` added by @nuno-faria  in 
https://github.com/apache/datafusion/pull/19054 doesn't show anything:
   
   ```sql
   > CREATE EXTERNAL TABLE hits stored as parquet location 
'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';
   0 row(s) fetched.
   Elapsed 0.405 seconds.
   
   > select * from statistics_cache();
   
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
   | path | file_modified | file_size_bytes | e_tag | version | num_rows | 
num_columns | table_size_bytes | statistics_size_bytes |
   
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
   
+------+---------------+-----------------+-------+---------+----------+-------------+------------------+-----------------------+
   0 row(s) fetched.
   Elapsed 0.015 seconds.
   ```
   
   
   ### Describe the solution you'd like
   
   I would like a session scoped FileStatisticsCache that is shared between 
statements / ListingTables , the same way the DefaultFilesMetadataCache  is: 
created:
   
   
https://github.com/apache/datafusion/blob/57c0dda7b512154dd8ee9abe4f07d0462759d224/datafusion/execution/src/cache/cache_manager.rs#L165-L171
   
   
   
   The code for ListingTable  somewhat unobviously sets a 
DefaultFileStatisticsCache here
   
https://github.com/apache/datafusion/blob/9f725d9c7064813cda0de0f87d115354b68d76e6/datafusion/catalog-listing/src/table.rs#L260-L259
   
   
   
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   We probably also need to add a limit, like this:
   - https://github.com/apache/datafusion/issues/19052


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to