Github user maryannxue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21594#discussion_r196899266
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
@@ -107,22 +107,35 @@ class CacheManager extends Logging {
/**
* Un-cache all the cache entries that refer to the given plan.
*/
- def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit =
writeLock {
- uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
+ def uncacheQuery(query: Dataset[_],
+ cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
+ uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking)
}
/**
* Un-cache all the cache entries that refer to the given plan.
*/
- def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking:
Boolean): Unit = writeLock {
+ def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
+ cascade: Boolean, blocking: Boolean): Unit = writeLock {
val it = cachedData.iterator()
+ val needToRecache =
scala.collection.mutable.ArrayBuffer.empty[CachedData]
while (it.hasNext) {
val cd = it.next()
if (cd.plan.find(_.sameResult(plan)).isDefined) {
- cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
it.remove()
+ if (cascade || cd.plan.sameResult(plan)) {
+ cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
+ } else {
+ val plan = spark.sessionState.executePlan(cd.plan).executedPlan
+ val newCache = InMemoryRelation(
--- End diff --
Yes, you are right, although it wouldn't lead to any error just like all
other compiled dataframes that refer to this old InMemoryRelation. I'll change
this piece of code. But you've brought out another interesting question:
A scenario similar to what you've mentioned:
```df1 = ...
df2 = df1.filter(...)
df2.cache()
df1.cache()
df1.collect()
```
, which means we cache the dependent cache first and the cache being
depended upon next. Optimally when you do df2.collect(), you would like df2 to
use the cached data in df1, but it doesn't work like this now since df2's
execution plan has already been generated before we call df1.cache(). It might
be worth revisiting the caches and update their plans if necessary when we call
cacheQuery()
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]