anchovYu commented on PR #45935:
URL: https://github.com/apache/spark/pull/45935#issuecomment-2048878434
Hi @ahshahid , thanks for the proposal and the PR. However, the current
Dataframe cache design has a lot of design flaws, I would worry that improving
the cache hit rate in this case will make these problems worse:
* stale results
The current design of Dataframe cache is unable to proactively detect data
staleness. Consider the case the table is changed by some other applications -
Dataframe cache never knows that and it leads to outdated results. In this
case, increasing the hit rate potentially increases the risk of hitting stale
data.
* unstable performance
Since Dataframe cache is shared across Spark sessions and applied
automatically, one session could use the data cached by another session in the
same driver, or no longer be able to use the cached data when it is invalidated
by another session in the same driver, all without any notice. It results in
unpredictable performance for even two consecutive executions on the same query
in one session in the same driver. Similarly, increasing the hit rate may make
it easier to encounter such unstable performance issues.
cc @cloud-fan
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]