zhuqi-lucas commented on issue #22553: URL: https://github.com/apache/datafusion/issues/22553#issuecomment-4621851231
> > Implementor (a moka-backed cache, the default, etc.) owns the singleflight + loader internally. fetch_metadata in datafusion/datasource-parquet/src/metadata.rs switches to calling this single method instead of the explicit get / is_valid_for / put sequence. Async-fn-in-trait is already established by infer_stats_and_ordering, so no new ergonomic terrain on that front. > > Basically I think we should be aiming for: > > 1. Not implement any more sophisticated caching in DataFusion itself (e.g. the thundering herd problem can be solved in downstream crates) > 2. Update the APIs in DataFusion to allow for that more sophisticated caching > > Given the current API is sync, here are two ideas: > > 1. Switch the cache API to be `async` (or some more explicit Future based) > 2. make DFParquetMetadata a trait / extendible so you can override the behavior of [`fetch_metadata`](https://github.com/apache/datafusion/blob/a7c2f7d3f844cd1ff76c8edb9d472d7979779153/datafusion/datasource-parquet/src/metadata.rs#L129-L128) Thanks @alamb , i will pick up option 1 as the start. Option 1 (async cache API) alone is enough for our use case. The thundering herd problem on cold metadata fetches at higher concurrency is exactly what we need solved, and a sync trait can't express singleflight semantics. With an async API on FileMetadataCache we can plug in a downstream cache implementation (moka's get_with(loader) gives singleflight for free, or a tokio OnceCell-based one) without DataFusion itself having to pick a strategy. The default in-memory impl could just unconditionally call the loader (no singleflight, same behavior as today). Downstream crates can swap in moka or OnceCell-based implementations when they need singleflight. Will keep option 2 (DFParquetMetadata trait) as a future option if use cases that aren't expressible through the cache API show up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
