mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1101657140
##########
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##########
@@ -1424,6 +1457,16 @@ private[spark] class BlockManager(
blockStoreUpdater.save()
}
+ // Check whether a rdd block is visible or not.
+ private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+ // If the rdd block visibility information not available in the block
manager,
+ // asking master for the information.
+ if (blockInfoManager.isRDDBlockVisible(blockId)) {
+ return true
+ }
+ master.isRDDBlockVisible(blockId)
Review Comment:
To clarify, (my previous comment was poorly worded).
A block would be invisible for three reasons primarily (and minor variants
of these) :
a) Some task T1 is still actively working on the partition which computed it.
b) Some task T1 on E1 generated the partition, which got pushed to E2 due to
replication - and is not yet visible, since T1 is still running.
c) Some task T1 on E1 generated B1, but failed - and so block exists on E1
but is invisible.
In these cases, a block which is invisible would become visible once the
task T1 successfully complete - or (in case of failure), is dropped and becomes
invisible.
So the transition is only from invisible -> visible or invisible -> dropped.
When a block transitions to becoming visible (due to an explicit message
from driver to the executor which generated it, and to the replicas), it
remains in that state until it is dropped explicitly (which is again another
message from driver).
So the proposal was:
a) Maintain cached state of block at executor - once it is visible, it never
changes from that.
b) If invisible, it could be a potential race (T1 completed, driver yet to
update B1 to E1, but launched some T2 which wants to read B1) - so we have to
query driver, but we can cache the result is visible.
b.1) We can minimize this impact by dropping all B1's when the task T1 fails
at E1 (driver would send the update anyway).
So essentially, we can cache only if the block is visible - if it is not
visible, we always have to query driver (with (b.1) to minimize needing to
query failures)
Thoughts ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]