anchovYu commented on PR #45935:
URL: https://github.com/apache/spark/pull/45935#issuecomment-2048878434

   Hi @ahshahid , thanks for the proposal and the PR. However, the current 
Dataframe cache design has a lot of design flaws, I would worry that improving 
the cache hit rate in this case will make these problems worse:
   
   * stale results
     The current design of Dataframe cache is unable to proactively detect data 
staleness. Consider the case the table is changed by some other applications - 
Dataframe cache never knows that and it leads to outdated results. In this 
case, increasing the hit rate potentially increases the risk of hitting stale 
data.
   * unstable performance
     Since Dataframe cache is shared across Spark sessions and applied 
automatically, one session could use the data cached by another session in the 
same driver, or no longer be able to use the cached data when it is invalidated 
by another session in the same driver, all without any notice. It results in 
unpredictable performance for even two consecutive executions on the same query 
in one session in the same driver. Similarly, increasing the hit rate may make 
it easier to encounter such unstable performance issues.
   
   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to