DaveDeCaprio commented on a change in pull request #24028: [SPARK-26917][SQL] 
Further reduce locks in CacheManager
URL: https://github.com/apache/spark/pull/24028#discussion_r264046070
 
 

 ##########
 File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala
 ##########
 @@ -144,16 +144,10 @@ class CacheManager extends Logging {
       } else {
         _.sameResult(plan)
       }
-    val plansToUncache = mutable.Buffer[CachedData]()
-    readLock {
-      val it = cachedData.iterator()
-      while (it.hasNext) {
-        val cd = it.next()
-        if (shouldRemove(cd.plan)) {
-          plansToUncache += cd
-        }
-      }
+    val cachedDataCopy = readLock {
+      cachedData.asScala.clone()
     }
+    val plansToUncache = cachedDataCopy.filter(cd => shouldRemove(cd.plan))
 
 Review comment:
   Yes, the problem is that the "shouldRemove" function is passed into this 
call.  If that call is expensive, it causes the lock to be held for arbitrarily 
long amounts of time.
   
   "shouldRemove" when called from "recacheByPath" causes a full traversal of 
the entire logical plan tree for every cached plan.  In the process of doing 
this it will regenerate path strings for every file referenced by every single 
plan.  In our situations at least this is easily many orders of magnitude more 
memory overhead than a shallow copy of the list.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to