[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

via GitHub Tue, 14 Feb 2023 23:47:20 -0800


mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1106755256



##########
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##########
@@ -1424,6 +1457,16 @@ private[spark] class BlockManager(
     blockStoreUpdater.save()
   }
 
+  // Check whether a rdd block is visible or not.
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+    // If the rdd block visibility information not available in the block 
manager,
+    // asking master for the information.
+    if (blockInfoManager.isRDDBlockVisible(blockId)) {
+      return true
+    }
+    master.isRDDBlockVisible(blockId)

Review Comment:
   > I'm not trying to talk about the cache stuff. But just try to highlight 
this. So this should somehow be a behavior change, right? If T1 generates B1 in 
early time and T1 turns out to be a long-running task, it can be terrible for 
tasks like T2 which reads B1.
   
   Do we actually have usecases where some other stage/task is depending on an 
earlier task generating a block as a prefix to its computation and the task 
itself has not completed ?
   
   Looks like a fairly brittle assumption, no ? (Unless I misunderstood the 
usecase here !)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

Reply via email to