Re: [PR] [SPARK-47609][SQL] Making CacheLookup more optimal to minimize cache miss [spark]

via GitHub Wed, 11 Dec 2024 11:36:07 -0800


ahshahid commented on PR #49150:
URL: https://github.com/apache/spark/pull/49150#issuecomment-2536944818


   Previous chain of comments:
   From @anchovYu 
   Hi @ahshahid , thanks for the proposal and the PR. However, the current 
Dataframe cache design has a lot of design flaws, I would worry that improving 
the cache hit rate in this case will make these problems worse:
   
   stale results
   The current design of Dataframe cache is unable to proactively detect data 
staleness. Consider the case the table is changed by some other applications - 
Dataframe cache never knows that and it leads to outdated results. In this 
case, increasing the hit rate potentially increases the risk of hitting stale 
data.
   unstable performance
   Since Dataframe cache is shared across Spark sessions and applied 
automatically, one session could use the data cached by another session in the 
same driver, or no longer be able to use the cached data when it is invalidated 
by another session in the same driver, all without any notice. It results in 
unpredictable performance for even two consecutive executions on the same query 
in one session in the same driver. Similarly, increasing the hit rate may make 
it easier to encounter such unstable performance issues.
   cc @cloud-fan
   
   
   From @ahshahid 
   @anchovYu
   I see, what you are saying.
   Actually this PR arose as sub - pr for the issue 
[spark-45959](https://issues.apache.org/jira/browse/SPARK-45959).
   The PR for the same is https://github.com/apache/spark/pull/43854
   The increase of the cache hit is a by product of the above PR.
   From my experience and even the current customer use cases, this issue ( 
[spark-45959](https://issues.apache.org/jira/browse/SPARK-45959)) has caused 
increased in compile time from say under a minute to anything ranging from 2hrs 
- 8 hrs +.
   Whether the issues of stale cache etc increase or not would very much depend 
upon the nature of the dataframe cached .
   Also most of the time the user would cache a dataframe and then keep 
building new data frames from the base. So in that sense, even currently , 
there is no way to prevent new data frames from not using cached data , if the 
sub-plan matches.
   And whatever fix goes in for stale cache resolution issue would 
automatically apply to this change too.
   So pls consider this PR per se not in the light of increasing the cache hit, 
but as a requirement to resolve the issue of 
[spark-45959](https://issues.apache.org/jira/browse/SPARK-45959).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47609][SQL] Making CacheLookup more optimal to minimize cache miss [spark]

Reply via email to