[GitHub] [spark] Victsm commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions

GitBox Tue, 17 Nov 2020 22:10:06 -0800


Victsm commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525835969




##########
File path: core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala
##########
@@ -92,4 +93,16 @@ private[spark] trait SchedulerBackend {
    */
   def maxNumConcurrentTasks(rp: ResourceProfile): Int
 
+  /**
+   * Get the list of host locations for push based shuffle
+   *
+   * Currently push based shuffle is disabled for both stage retry and stage 
reuse cases

Review comment:
       Both are true.
   `getShufflePushMergerLocations` will be invoked only once per 
`ShuffleDependency`.
   Thus retried stages will get the same merger locations.
   In #30062, the way we implemented the block push handling logic would ignore 
blocks received after shuffle finalization.
   
   
https://github.com/apache/spark/blob/dd32f45d2058d00293330c01d3d9f53ecdbc036c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java#L132
   
   So, blocks pushed from the retry stage will be ignored, and this is what's 
stated in this comment about `push based shuffle is disabled for both stage 
retry and stage reuse cases`.
   Ignoring blocks pushed from the retry stage is reasonable, since the block 
data from these retried tasks most likely have already been merged.
   Making sure the retried stage use the same merger location is critical to 
ensure we don't run into data duplication issues.
   
   The only exception is for indeterministic stage retry, which we have created 
SPARK-32923 for it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Victsm commented on a change in pull request #30164: [SPARK-32919][SHUFFLE][test-maven][test-hadoop2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions

Reply via email to