Ted-Jiang opened a new issue, #7556:
URL: https://github.com/apache/arrow-datafusion/issues/7556

   ### Is your feature request related to a problem or challenge?
   
   In our systems try to pass logical plan to datafusion with enable collect 
statics.  The source table is from remote storage, sometimes it cost a few 
seconds to read parquet metadata to collect statics.
   From log
   ```rust
    datafusion::datasource::listing::table: Not hit cache infer_stats 
ObjectMeta { location: Path { raw: 
"working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet"
 }, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s 
   ```
   
   So i check the code see there is a cache called `StatisticsCache` construct 
here:
   
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
   It seems every time build a plan then insert an empty cache, only infer same 
file statistics in same plan can get benefit.
   
   So I want to share the statics cache in session level 😄  to solve fetch 
remote file statistics not stable. I think many others query engine did this 
too.
   
   ### Describe the solution you'd like
   
   Add a cache manager to deal with all cache during the session lifetime.
   
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to