[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

via GitHub Thu, 09 Feb 2023 07:38:06 -0800


mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1101657140



##########
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##########
@@ -1424,6 +1457,16 @@ private[spark] class BlockManager(
     blockStoreUpdater.save()
   }
 
+  // Check whether a rdd block is visible or not.
+  private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+    // If the rdd block visibility information not available in the block 
manager,
+    // asking master for the information.
+    if (blockInfoManager.isRDDBlockVisible(blockId)) {
+      return true
+    }
+    master.isRDDBlockVisible(blockId)

Review Comment:
   To clarify, (my previous comment was poorly worded).
   
   A block would be invisible for three reasons primarily (and minor variants 
of these) :
   a) Some task T1 is still actively working on the partition which computed it.
   b) Some task T1 on E1 generated the partition, which got pushed to E2 due to 
replication - and is not yet visible, since T1 is still running.
   c) Some task T1 on E1 generated B1, but failed - and so block exists on E1 
but is invisible.
   
   
   In these cases, a block which is invisible would become visible once the 
task T1 successfully complete - or (in case of failure), is dropped and becomes 
invisible.
   So the transition is only from invisible -> visible or invisible -> dropped.
   When a block transitions to becoming visible (due to an explicit message 
from driver to the executor which generated it, and to the replicas), it 
remains in that state until it is dropped explicitly (which is again another 
message from driver).
   
   So the proposal was:
   a) Maintain cached state of block at executor - once it is visible, it never 
changes from that.
   b) If invisible, it could be a potential race (T1 completed, driver yet to 
update B1 to E1, but launched some T2 which wants to read B1) - so we have to 
query driver, but we can cache the result is visible.
   b.1) We can minimize this impact by dropping all B1's when the task T1 fails 
at E1 (driver would send the update anyway).
   
   
   So essentially, we can cache only if the block is visible - if it is not 
visible, we always have to query driver (with (b.1) to minimize needing to 
query failures)
   
   Thoughts ?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] mridulm commented on a diff in pull request #39459: [SPARK-41497][CORE] Fixing accumulator undercount in the case of the retry task with rdd cache

Reply via email to