ahshahid commented on PR #49150: URL: https://github.com/apache/spark/pull/49150#issuecomment-2536944818
Previous chain of comments: From @anchovYu Hi @ahshahid , thanks for the proposal and the PR. However, the current Dataframe cache design has a lot of design flaws, I would worry that improving the cache hit rate in this case will make these problems worse: stale results The current design of Dataframe cache is unable to proactively detect data staleness. Consider the case the table is changed by some other applications - Dataframe cache never knows that and it leads to outdated results. In this case, increasing the hit rate potentially increases the risk of hitting stale data. unstable performance Since Dataframe cache is shared across Spark sessions and applied automatically, one session could use the data cached by another session in the same driver, or no longer be able to use the cached data when it is invalidated by another session in the same driver, all without any notice. It results in unpredictable performance for even two consecutive executions on the same query in one session in the same driver. Similarly, increasing the hit rate may make it easier to encounter such unstable performance issues. cc @cloud-fan From @ahshahid @anchovYu I see, what you are saying. Actually this PR arose as sub - pr for the issue [spark-45959](https://issues.apache.org/jira/browse/SPARK-45959). The PR for the same is https://github.com/apache/spark/pull/43854 The increase of the cache hit is a by product of the above PR. From my experience and even the current customer use cases, this issue ( [spark-45959](https://issues.apache.org/jira/browse/SPARK-45959)) has caused increased in compile time from say under a minute to anything ranging from 2hrs - 8 hrs +. Whether the issues of stale cache etc increase or not would very much depend upon the nature of the dataframe cached . Also most of the time the user would cache a dataframe and then keep building new data frames from the base. So in that sense, even currently , there is no way to prevent new data frames from not using cached data , if the sub-plan matches. And whatever fix goes in for stale cache resolution issue would automatically apply to this change too. So pls consider this PR per se not in the light of increasing the cache hit, but as a requirement to resolve the issue of [spark-45959](https://issues.apache.org/jira/browse/SPARK-45959). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
