Ted-Jiang opened a new issue, #7556:
URL: https://github.com/apache/arrow-datafusion/issues/7556
### Is your feature request related to a problem or challenge?
In our systems try to pass logical plan to datafusion with enable collect
statics. The source table is from remote storage, sometimes it cost a few
seconds to read parquet metadata to collect statics.
From log
```rust
datafusion::datasource::listing::table: Not hit cache infer_stats
ObjectMeta { location: Path { raw:
"working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet"
}, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s
```
So i check the code see there is a cache called `StatisticsCache` construct
here:
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same
file statistics in same plan can get benefit.
So I want to share the statics cache in session level 😄 to solve fetch
remote file statistics not stable. I think many others query engine did this
too.
### Describe the solution you'd like
Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]