[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

maryannxue Wed, 20 Jun 2018 11:35:49 -0700

Github user maryannxue commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21594#discussion_r196899266
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala ---
    @@ -107,22 +107,35 @@ class CacheManager extends Logging {
       /**
        * Un-cache all the cache entries that refer to the given plan.
        */
    -  def uncacheQuery(query: Dataset[_], blocking: Boolean = true): Unit = 
writeLock {
    -    uncacheQuery(query.sparkSession, query.logicalPlan, blocking)
    +  def uncacheQuery(query: Dataset[_],
    +    cascade: Boolean, blocking: Boolean = true): Unit = writeLock {
    +    uncacheQuery(query.sparkSession, query.logicalPlan, cascade, blocking)
       }
     
       /**
        * Un-cache all the cache entries that refer to the given plan.
        */
    -  def uncacheQuery(spark: SparkSession, plan: LogicalPlan, blocking: 
Boolean): Unit = writeLock {
    +  def uncacheQuery(spark: SparkSession, plan: LogicalPlan,
    +    cascade: Boolean, blocking: Boolean): Unit = writeLock {
         val it = cachedData.iterator()
    +    val needToRecache = 
scala.collection.mutable.ArrayBuffer.empty[CachedData]
         while (it.hasNext) {
           val cd = it.next()
           if (cd.plan.find(_.sameResult(plan)).isDefined) {
    -        cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
             it.remove()
    +        if (cascade || cd.plan.sameResult(plan)) {
    +          cd.cachedRepresentation.cacheBuilder.clearCache(blocking)
    +        } else {
    +          val plan = spark.sessionState.executePlan(cd.plan).executedPlan
    +          val newCache = InMemoryRelation(
    --- End diff --
    
    Yes, you are right, although it wouldn't lead to any error just like all 
other compiled dataframes that refer to this old InMemoryRelation. I'll change 
this piece of code. But you've brought out another interesting question:
    A scenario similar to what you've mentioned:
    ```df1 = ...
    df2 = df1.filter(...)
    df2.cache()
    df1.cache()
    df1.collect()
    ```
    , which means we cache the dependent cache first and the cache being 
depended upon next. Optimally when you do df2.collect(), you would like df2 to 
use the cached data in df1, but it doesn't work like this now since df2's 
execution plan has already been generated before we call df1.cache(). It might 
be worth revisiting the caches and update their plans if necessary when we call 
cacheQuery()



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21594: [SPARK-24596][SQL] Non-cascading Cache Invalidati...

Reply via email to