bart-samwel commented on PR #41777: URL: https://github.com/apache/spark/pull/41777#issuecomment-1613438677
> > @ryan-johnson-databricks I think it's a good idea to extend the current cache API to allow users to customize the refresh policy per query cache. We can explore it later. But a session config should still be useful to disable auto-refresh for all INSERTs in the session. > > This is complicated... presumably it's readers who do (or do not) want stable caching... but the writers are the ones who have to decide whether to refresh or not. And they may have no idea what the reader wants. Conversely, the reader may have no control over the spark confs a writer uses... If you allow the user of `cache()` to specify whether they allow staleness, they run the risk of making _other users_ of the same cluster see stale results. Users who did not opt in to stale results themselves. That seems like a highly undesirable outcome. If you allow sessions to indicate that _their writes_ won't cause caches to refresh, then: * To *other users* of the same cluster, the behavior changes to be the same as if the workload was running on another cluster. I.e., if all workloads were running on separate clusters, then they would not cause each other's caches to refresh. This basically restores that behavior. That seems OK. * To *the same session* on the same cluster, it causes cached results to be not refreshed. Is that a problem? I think it is a problem. If someone else on the same cluster called `cache()` on a particular dataframe (without the special setting), and this session did not, then this session will still not refresh the cached dataframe _even though it did not ask for it to be cached_. I.e., something like this: 1. Session 1 caches a dataframe that reads table T. 2. Session 2 sets the config "don't auto refresh cache based on my writes". 3. Session 2 writes something to table T. 4. Session 2 creates a dataframe on table T and reads it. IT DOES NOT SEE ITS OWN CHANGES. So setting such a config may change the outcome of (3) even if you did not cache _anything_, just because somebody else on the cluster cached something. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
