Github user squito commented on a diff in the pull request:
https://github.com/apache/spark/pull/19041#discussion_r179266769
--- Diff:
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
---
@@ -252,6 +257,44 @@ class BlockManagerMasterEndpoint(
blockManagerIdByExecutor.get(execId).foreach(removeBlockManager)
}
+ private def recoverLatestRDDBlock(
+ execId: String,
+ excludeExecutors: Seq[String],
+ context: RpcCallContext): Unit = {
+ logDebug(s"Replicating first cached block on $execId")
+ val excluded = excludeExecutors.flatMap(blockManagerIdByExecutor.get)
+ val response: Option[Future[Boolean]] = for {
+ blockManagerId <- blockManagerIdByExecutor.get(execId)
+ info <- blockManagerInfo.get(blockManagerId)
+ blocks = info.cachedBlocks.collect { case r: RDDBlockId => r }
+ // As a heuristic, prioritize replicating the latest rdd. If this
succeeds,
+ // CacheRecoveryManager will try to replicate the remaining rdds.
+ firstBlock <- if (blocks.isEmpty) None else
Some(blocks.maxBy(_.rddId))
+ replicaSet <- blockLocations.asScala.get(firstBlock)
+ // Add 2 to force this block to be replicated to one new executor.
+ maxReps = replicaSet.size + 2
--- End diff --
I figured out why you need +2 instead of +1. The existing code wants you
to explicitly *remove* id of the blockManager you're trying to replicate from
in `replicaSet`. See:
https://github.com/apache/spark/blob/cccaaa14ad775fb981e501452ba2cc06ff5c0f0a/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L236-L239
While the existing code is confusing, I definitely don't like using +2 here
as a workaround, as it gets pretty confusing. I'd at least update the comments
on `BlockManager.replicate()` etc., or maybe just change its behavior and
update the callsites.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]