maryannxue commented on a change in pull request #23644: [SPARK-26708][SQL]
Incorrect result caused by inconsistency between a SQL cache's cached RDD and
its physical plan
URL: https://github.com/apache/spark/pull/23644#discussion_r251634750
##########
File path:
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
##########
@@ -180,7 +180,26 @@ class CacheManager extends Logging {
val it = cachedData.iterator()
while (it.hasNext) {
val cd = it.next()
- if (condition(cd.plan)) {
+ // If `clearCache` is false (which means the recache request comes
from a non-cascading
+ // cache invalidation) and the cache buffer has already been loaded,
we do not need to
+ // re-compile a physical plan because the old plan will not be used
any more by the
+ // CacheManager although it still lives in compiled `Dataset`s and it
could still work.
Review comment:
The only chance an old plan gets used by the CacheManager is for someone to
call `CacheRDDBuilder`.`clearCache()` without removing it from the
CacheManager. We have no interface that can lead to such scenarios. Even if the
old plan did get used, it wouldn't cause any serious issue other than the
inefficiency of loading an "unmanaged" rdd cache, which could happen with a
plain Dataframe with an outdated plan holding a reference to an uncached cache.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]