mridulm commented on code in PR #39459:
URL: https://github.com/apache/spark/pull/39459#discussion_r1106753599
##########
core/src/main/scala/org/apache/spark/storage/BlockManager.scala:
##########
@@ -1424,6 +1457,16 @@ private[spark] class BlockManager(
blockStoreUpdater.save()
}
+ // Check whether a rdd block is visible or not.
+ private[spark] def isRDDBlockVisible(blockId: RDDBlockId): Boolean = {
+ // If the rdd block visibility information not available in the block
manager,
+ // asking master for the information.
+ if (blockInfoManager.isRDDBlockVisible(blockId)) {
+ return true
+ }
+ master.isRDDBlockVisible(blockId)
Review Comment:
> With above mechanism, do you think we still need another cache to store
the visiblity information in executor or do we also need to cache the state in
executors not having the cached block data stored?
You are right, the current PR is handling it on a second read ... Since we
are already checking for `blockInfoManager.isRDDBlockVisible(blockId)` first.
This should cover the case of (1) - and we will always query in case block
is available, and we have to distinguish (2).
(2.1) would be an optimization we can attempt later on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]