agrawaldevesh commented on a change in pull request #28370:
URL: https://github.com/apache/spark/pull/28370#discussion_r438959909
##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1551,30 +1555,36 @@ private[spark] class BlockManager(
}
/**
- * Called for pro-active replenishment of blocks lost due to executor
failures
+ * Replicates a block to peer block managers based on existingReplicas and
maxReplicas
*
* @param blockId blockId being replicate
* @param existingReplicas existing block managers that have a replica
* @param maxReplicas maximum replicas needed
+ * @param maxReplicationFailures number of replication failures to tolerate
before
+ * giving up.
+ * @return whether block was successfully replicated or not
*/
def replicateBlock(
Review comment:
@prakharjain09 / @holdenk, Thank you for this improvement. I had a
question please:
(I am still new to this code paths and I am not totally sure of what I am
talking about. So if there is something I am missing please help me fill the
gaps :-). )
I notice that `replicateBlock` is already called during the executor removal
codepath. ie,
`org.apache.spark.storage.BlockManagerMasterEndpoint#removeBlockManager` does
the replication to other peers if `spark.storage.replication.proactive=true`.
This seemed to have been implemented in
[SPARK-15355](https://issues.apache.org/jira/browse/SPARK-15355). And
`org.apache.spark.storage.BlockManagerMasterEndpoint#removeBlockManager` is
triggered when the executor is "eventually" lost.
I understand it is a bit late to do the replication when the executor is
indeed lost: Since decommissioning as implemented in
https://github.com/apache/spark/pull/26440 does not really trigger eager
executor loss. We instead merely stop scheduling on the decom'd executor and
let it be shot down out of band. Which means that the replication triggered in
SPARK-15355 would be too late.
I like the approach taken in this PR to eagerly tell the executor
(block-manager) to start replication when the decom is first initiated, to give
it more time to be useful. But I wonder if you implemented this somewhat
differently by leveraging the [existing "eager replication" loop]
(https://github.com/apache/spark/pull/14412/files#diff-186864190089a718680accb51de5f0d4R215).
Thanks !
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]