tgravescs commented on a change in pull request #24374: [SPARK-27366][CORE]
Support GPU Resources in Spark job scheduling
URL: https://github.com/apache/spark/pull/24374#discussion_r289179092
##########
File path:
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -263,7 +272,7 @@ class CoarseGrainedSchedulerBackend(scheduler:
TaskSchedulerImpl, val rpcEnv: Rp
val workOffers = activeExecutors.map {
case (id, executorData) =>
new WorkerOffer(id, executorData.executorHost,
executorData.freeCores,
- Some(executorData.executorAddress.hostPort))
+ Some(executorData.executorAddress.hostPort),
executorData.availableResources.toMap)
Review comment:
Oh sorry you are right, I misread, and that is actually worse I think. I
don't think we want to allow the taskScheduler to be modifying a data structure
owned by the CoarseGrainedSchedulerBackend. At this point free cores is
handled separately. We create the Worker offers that has its own book keeping
separate from the executorDataMap in CoarseGrainedSchedulerBacked. The task
scheduler can do whatever it wants with the WorkerOffers and then the
CoarseGrainedSchedulerBackend is responsible for updating the executorDataMap.
If they are kept totally separate you don't have to worry about the
executorDataMap being wrong if serializedTask is to large or if some other
thing happens before the launchTasks. If you keep them together you have to be
very sure any thing that fails after reservation unreserves them, which I think
is more of a maintenance issue. If you miscount in one iteration of the task
scheduler loop dealing with WorkerOffers it is kind of ok in the sense the next
makeOffers call will fix it up and have the right data since it will be pulled
from the separate executorDataMap.
So I think we should make a deep copy of that to send into the WorkerOffer.
normally I wouldn't expect much to be in there so it shouldn't be to expensive
to do that. But it does get called a lot if you have a lot of executors so it
might be interesting to see what the overhead of that is when you have
thousands of executors. I'll try to look at it some more tomorrow.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]