[GitHub] [spark] agrawaldevesh commented on a change in pull request #28370: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned

GitBox Thu, 11 Jun 2020 10:44:25 -0700


agrawaldevesh commented on a change in pull request #28370:
URL: https://github.com/apache/spark/pull/28370#discussion_r438959909




##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1551,30 +1555,36 @@ private[spark] class BlockManager(
   }
 
   /**
-   * Called for pro-active replenishment of blocks lost due to executor 
failures
+   * Replicates a block to peer block managers based on existingReplicas and 
maxReplicas
    *
    * @param blockId blockId being replicate
    * @param existingReplicas existing block managers that have a replica
    * @param maxReplicas maximum replicas needed
+   * @param maxReplicationFailures number of replication failures to tolerate 
before
+   *                               giving up.
+   * @return whether block was successfully replicated or not
    */
   def replicateBlock(

Review comment:
       @prakharjain09 / @holdenk, Thank you for this improvement. I had a 
question please: 
   
   (I am still new to this code paths and I am not totally sure of what I am 
talking about. So if there is something I am missing please help me fill the 
gaps :-). )
   
   I notice that `replicateBlock` is already called during the executor removal 
codepath. ie, 
`org.apache.spark.storage.BlockManagerMasterEndpoint#removeBlockManager` does 
the replication to other peers if `spark.storage.replication.proactive=true`. 
This seemed to have been implemented in 
[SPARK-15355](https://issues.apache.org/jira/browse/SPARK-15355). And 
`org.apache.spark.storage.BlockManagerMasterEndpoint#removeBlockManager` is 
triggered when the executor is "eventually" lost.
   
   I understand it is a bit late to do the replication when the executor is 
indeed lost: Since decommissioning as implemented in 
https://github.com/apache/spark/pull/26440 does not really trigger eager 
executor loss. We instead merely stop scheduling on the decom'd executor and 
let it be shot down out of band. Which means that the replication triggered in 
SPARK-15355 would be too late.
   
   I like the approach taken in this PR to eagerly tell the executor 
(block-manager) to start replication when the decom is first initiated, to give 
it more time to be useful. But I wonder if you implemented this somewhat 
differently by leveraging the [existing "eager replication" loop] 
(https://github.com/apache/spark/pull/14412/files#diff-186864190089a718680accb51de5f0d4R215).
   
   Thanks !




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] agrawaldevesh commented on a change in pull request #28370: [SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned

Reply via email to