Re: [I] `datafusion-query-cache` - caching intermediate results for faster repeated queries [datafusion]

via GitHub Wed, 09 Oct 2024 00:00:47 -0700


sgrebnov commented on issue #12779:
URL: https://github.com/apache/datafusion/issues/12779#issuecomment-2401494063

This is very cool, wanted to share a few ideas/concepts that worked well for
[https://github.com/spiceai/spiceai ](https://github.com/spiceai/spiceai )–
recently added simple caching using DataFusion with a slightly different
approach – without solving the delta problem between prefetched/cached and
actual data. In our case, we control dataset updates in most cases and just
perform cache invalidation. Responding with outdated information per
configurable TTL for other cases is expected behavior.
1. Cache is based on the root `LogicalPlan`, where the query is first
transformed into a logical plan that is used as a key.
We used the root logical plan, but I imagine this can be generalized to
apply the same approach on the execution plan level to return cached items
instead of actual execution. Will work well for scenarios where predicate push
downs are not fully supported or unavailable parquet encoded statistics so
there are repetitive executions/inputs even for different queries.
1. I like the idea of not having opinions about where the cached data is
stored; for us, [moka](https://docs.rs/moka/latest/moka/) worked the best. We
compared it with a few other libraries in terms of performance, etc.
1. Worked well for us – specifying max cache size (configurable cache size).
We operate streams, so we just [wrap response record batches
stream](https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/utils.rs#L33)
to try caching records until we see that the response is too large and we
don’t want to cache it. Total size is implemented by adding weights (actual
size) to cached items (part of moka functionality).
1. Worked well for us: cache invalidation approach – tracking input datasets
as part of the cache based on the [logical plan information
](https://github.com/spiceai/spiceai/blob/45532b1fd73936586aed1085a07f81061f767947/crates/cache/src/utils.rs#L78)allows
for simple cache invalidation when a dataset is updated. We do this when we
update the local dataset copy (materialized dataset copy).
1. Cached items eviction algorithm - should be independent IMO. We use LRU +
configurable TTL

Cache implementation:
https://github.com/spiceai/spiceai/blob/trunk/crates/cache/src/lru_cache.rs
Usage example:
https://github.com/spiceai/spiceai/blob/trunk/crates/runtime/src/datafusion/query.rs#L167

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] `datafusion-query-cache` - caching intermediate results for faster repeated queries [datafusion]

Reply via email to