Github user mgummelt commented on a diff in the pull request:
https://github.com/apache/spark/pull/10924#discussion_r62077076
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
---
@@ -326,11 +340,25 @@ private[spark] class CoarseMesosSchedulerBackend(
d.launchTasks(
Collections.singleton(offer.getId),
offerTasks.asJava)
- } else { // decline
- logDebug(s"Declining offer: $id with attributes: $offerAttributes
" +
- s"mem: $offerMem cpu: $offerCpus")
-
- d.declineOffer(offer.getId)
+ } else if (totalCoresAcquired >= maxCores) {
+ // We already acquired the maximum number of cores so we don't
need to get new offers
+ // unless an executor goes down. Setting a high "refuse seconds"
filter is especially
+ // important when running a lot of frameworks in the same Mesos
cluster to avoid resource
+ // starvation. One such case of starvation happens when running
many small Spark apps
+ // (e.g. small Spark streaming jobs) then a new big Spark app
would get offered only a
+ // fraction of the cores available in the cluster and Mesos would
then stop sending it
+ // offers. That's because the small apps have a much smaller "max
share" so they get the
+ // offers first. With a low number of apps it's okay because with
the default
+ // refuse_seconds value of 5 seconds it's enough time for Mesos to
cycle through every
+ // app and send offers to each of them. But as the number of apps
increases it becomes
+ // more and more problematic, to the point where Mesos stops
sending offers to the apps
+ // ranked the lowest by DRF, i.e. the big apps. We mitigate this
problem by declining
+ // the offers for a long period of time when we know that we don't
need offers anymore
+ // because the app already acquired all the cores it needs.
--- End diff --
This is a bit verbose. I think something like "Reject an offer for a
configurable amount of time to avoid starving other frameworks" is sufficient.
Also, thanks for the code docs, but I was thinking we should add this
config var to the user docs as well.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]