sgrebnov commented on issue #12779: URL: https://github.com/apache/datafusion/issues/12779#issuecomment-2401494063
This is very cool, wanted to share a few ideas/concepts that worked well for [https://github.com/spiceai/spiceai ](https://github.com/spiceai/spiceai )– recently added simple caching using DataFusion with a slightly different approach – without solving the delta problem between prefetched/cached and actual data. In our case, we control dataset updates in most cases and just perform cache invalidation. Responding with outdated information per configurable TTL for other cases is expected behavior. 1. Cache is based on the root `LogicalPlan`, where the query is first transformed into a logical plan that is used as a key. We used the root logical plan, but I imagine this can be generalized to apply the same approach on the execution plan level to return cached items instead of actual execution. Will work well for scenarios where predicate push downs are not fully supported or unavailable parquet encoded statistics so there are repetitive executions/inputs even for different queries. 1. I like the idea of not having opinions about where the cached data is stored; for us, [moka](https://docs.rs/moka/latest/moka/) worked the best. We compared it with a few other libraries in terms of performance, etc. 1. Worked well for us – specifying max cache size (configurable cache size). We operate streams, so we just [wrap response record batches stream](https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/utils.rs#L33) to try caching records until we see that the response is too large and we don’t want to cache it. Total size is implemented by adding weights (actual size) to cached items (part of moka functionality). 1. Worked well for us: cache invalidation approach – tracking input datasets as part of the cache based on the [logical plan information ](https://github.com/spiceai/spiceai/blob/45532b1fd73936586aed1085a07f81061f767947/crates/cache/src/utils.rs#L78)allows for simple cache invalidation when a dataset is updated. We do this when we update the local dataset copy (materialized dataset copy). 1. Cached items eviction algorithm - should be independent IMO. We use LRU + configurable TTL Cache implementation: https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/lru_cache.rs Usage example: https://github.com/spiceai/spiceai/blob/trunk/crates/runtime/src/datafusion/query.rs#L167 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
