zhuqi-lucas commented on issue #22553:
URL: https://github.com/apache/datafusion/issues/22553#issuecomment-4621851231

   > > Implementor (a moka-backed cache, the default, etc.) owns the 
singleflight + loader internally. fetch_metadata in 
datafusion/datasource-parquet/src/metadata.rs switches to calling this single 
method instead of the explicit get / is_valid_for / put sequence. 
Async-fn-in-trait is already established by infer_stats_and_ordering, so no new 
ergonomic terrain on that front.
   > 
   > Basically I think we should be aiming for:
   > 
   > 1. Not implement any more sophisticated caching in DataFusion itself (e.g. 
the thundering herd problem can be solved in downstream crates)
   > 2. Update the APIs in DataFusion to allow for that more sophisticated 
caching
   > 
   > Given the current API is sync, here are two ideas:
   > 
   > 1. Switch the cache API to be `async`  (or some more explicit Future based)
   > 2. make DFParquetMetadata a trait / extendible so you can override the 
behavior of 
[`fetch_metadata`](https://github.com/apache/datafusion/blob/a7c2f7d3f844cd1ff76c8edb9d472d7979779153/datafusion/datasource-parquet/src/metadata.rs#L129-L128)
   
   
   Thanks @alamb , i will pick up option 1 as the start.
   
    Option 1 (async cache API) alone is enough for our use case. The
     thundering herd problem on cold metadata fetches at higher concurrency
     is exactly what we need solved, and a sync trait can't express
     singleflight semantics. With an async API on FileMetadataCache we can
     plug in a downstream cache implementation (moka's get_with(loader)
     gives singleflight for free, or a tokio OnceCell-based one) without
     DataFusion itself having to pick a strategy.
   
     The default in-memory impl could just unconditionally call the loader
     (no singleflight, same behavior as today). Downstream crates can swap
     in moka or OnceCell-based implementations when they need singleflight.
   
     Will keep option 2 (DFParquetMetadata trait) as a future option if
     use cases that aren't expressible through the cache API show up.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to