alamb opened a new issue, #17091:
URL: https://github.com/apache/datafusion/issues/17091

   ### Is your feature request related to a problem or challenge?
   
   We are adding a parquet metadata cache to ListingTable 🎉  (thanks 
@nuno-faria @jonathanc-n and @shehabgamin )
   
   It turns out it is somewhat tricky to get right, and it is not always clear 
what is going on. Especially tricky is when the metadata is cached with page 
indexes, and sometimes without it, for example see this PR:
   - https://github.com/apache/datafusion/pull/17022
   
   
   
   
   
   ### Describe the solution you'd like
   
   I would like some way to see the contents of the cache with basic statistics
   
   ### Describe alternatives you've considered
   
   I suggest a twofold approach:
   1. Add APIs to the `DefaultFileMetadataCache` itself
   2. Add a function in `datafusion-cli` that uses those APIs to show the cache 
state
   
   This two pronged approach would
   1. Help debug the working of the cache with datafusion-cli
   2. Ensure the APIs on the cache can be used to build useful introspection 
tools
   3. Offer an example of how to build such a thing for others
   
   An example might look like
   ```sql
   select * from 
   ```
   
   And the output might look ike
   
   | path | e_tag | size_bytes | page_index | hits |
   |--------|--------|--------|--------|--------|
   | /foo/bar |  | 1234 | t | 12 |
   | /foo/baz | xdef| 3781 | t | 1|
   ...
   
   I think we could model its implementation on the `parquet_metadata` 
function: 
https://datafusion.apache.org/user-guide/cli/usage.html#parquet-metadata
   
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion-cli/src/functions.rs#L320
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to