vanzin commented on a change in pull request #24817: [SPARK-27963][core] Allow
dynamic allocation without a shuffle service.
URL: https://github.com/apache/spark/pull/24817#discussion_r298378800
##########
File path:
core/src/main/scala/org/apache/spark/scheduler/dynalloc/ExecutorMonitor.scala
##########
@@ -137,6 +254,21 @@ private[spark] class ExecutorMonitor(
val executorId = event.taskInfo.executorId
val exec = executors.get(executorId)
if (exec != null) {
+ // If the task succeeded and the stage generates shuffle data, record
that this executor
+ // holds data for the shuffle. Note that this ignores speculation, since
this code is not
+ // directly tied to the map output tracker that knows exactly which
shuffle blocks are
Review comment:
> if there is a speculative and non-speculative task, and both succeed, you
say both have shuffle data
Right.
> scheduler only sends one location when it sends out the next downstream
tasks
IIRC the map output tracker only keeps one location for each shuffle block.
```
def addMapOutput(mapId: Int, status: MapStatus): Unit = synchronized {
if (mapStatuses(mapId) == null) {
_numAvailableOutputs += 1
invalidateSerializedMapOutputStatusCache()
}
mapStatuses(mapId) = status
}
```
BTW that code is interesting in that it seems like it's possible to have the
tracker pointing at one executor while the serialized version is pointing at
another... so I guess, inadvertently, the behavior the comment is talking about
is "the right thing".
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]