[GitHub] [spark] bart-samwel commented on pull request #41777: [SPARK-44232][SQL] add a config for auto refresh query result cache

via GitHub Thu, 29 Jun 2023 08:45:09 -0700


bart-samwel commented on PR #41777:
URL: https://github.com/apache/spark/pull/41777#issuecomment-1613438677


   > > @ryan-johnson-databricks I think it's a good idea to extend the current 
cache API to allow users to customize the refresh policy per query cache. We 
can explore it later. But a session config should still be useful to disable 
auto-refresh for all INSERTs in the session.
   > 
   > This is complicated... presumably it's readers who do (or do not) want 
stable caching... but the writers are the ones who have to decide whether to 
refresh or not. And they may have no idea what the reader wants. Conversely, 
the reader may have no control over the spark confs a writer uses...
   
   If you allow the user of `cache()` to specify whether they allow staleness, 
they run the risk of making _other users_ of the same cluster see stale 
results. Users who did not opt in to stale results themselves. That seems like 
a highly undesirable outcome.
   
   If you allow sessions to indicate that _their writes_ won't cause caches to 
refresh, then:
   * To *other users* of the same cluster, the behavior changes to be the same 
as if the workload was running on another cluster. I.e., if all workloads were 
running on separate clusters, then they would not cause each other's caches to 
refresh. This basically restores that behavior. That seems OK.
   * To *the same session* on the same cluster, it causes cached results to be 
not refreshed. Is that a problem? I think it is a problem. If someone else on 
the same cluster called `cache()` on a particular dataframe (without the 
special setting), and this session did not, then this session will still not 
refresh the cached dataframe _even though it did not ask for it to be cached_.
   
   I.e., something like this:
   1. Session 1 caches a dataframe that reads table T.
   2. Session 2 sets the config "don't auto refresh cache based on my writes".
   3. Session 2 writes something to table T.
   4. Session 2 creates a dataframe on table T and reads it. IT DOES NOT SEE 
ITS OWN CHANGES.
   
   So setting such a config may change the outcome of (3) even if you did not 
cache _anything_, just because somebody else on the cluster cached something.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] bart-samwel commented on pull request #41777: [SPARK-44232][SQL] add a config for auto refresh query result cache

Reply via email to