Victsm commented on a change in pull request #30164:
URL: https://github.com/apache/spark/pull/30164#discussion_r525835969
##########
File path: core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala
##########
@@ -92,4 +93,16 @@ private[spark] trait SchedulerBackend {
*/
def maxNumConcurrentTasks(rp: ResourceProfile): Int
+ /**
+ * Get the list of host locations for push based shuffle
+ *
+ * Currently push based shuffle is disabled for both stage retry and stage
reuse cases
Review comment:
Both are true.
`getShufflePushMergerLocations` will be invoked only once per
`ShuffleDependency`.
Thus retried stages will get the same merger locations.
In #30062, the way we implemented the block push handling logic would ignore
blocks received after shuffle finalization.
https://github.com/apache/spark/blob/dd32f45d2058d00293330c01d3d9f53ecdbc036c/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RemoteBlockPushResolver.java#L132
So, blocks pushed from the retry stage will be ignored, and this is what's
stated in this comment about `push based shuffle is disabled for both stage
retry and stage reuse cases`.
Ignoring blocks pushed from the retry stage is reasonable, since the block
data from these retried tasks most likely have already been merged.
Making sure the retried stage use the same merger location is critical to
ensure we don't run into data duplication issues.
The only exception is for indeterministic stage retry, which we have created
SPARK-32923 for it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]