[
https://issues.apache.org/jira/browse/SPARK-23423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16369221#comment-16369221
]
Igor Berman edited comment on SPARK-23423 at 2/20/18 7:33 AM:
--------------------------------------------------------------
[~skonto] so I've run application today with relevant logs at debug
level(previously I just had problems with loggers, that they were reconfigured
dynamically so I haven't seen reports of TASK_KILLED). Seems like with dynamic
allocation on with executors starting and shutting down the chances that every
slave will get 2 failures staring some executor are much higher that in regular
case(without dynamic allocation) seems like SPARK-19755 is the core issue here
- after half day of long running driver in client mode almost 1/3 of slaves out
of all mesos slaves could be marked as blacklisted.
the reasons for executor failures might be different and transient(e.g. port
collision)
I think I'll close this Jira as duplicate for SPARK-19755, WDYT?
Here just one example that out of 74 mesos slaves 16 already blacklisted
{code:java}
grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l
16{code}
was (Author: igor.berman):
[~skonto] so I've run application today with relevant logs at debug level.
Seems like with dynamic allocation on with executors starting and shutting down
the chances that every slave will get 2 failures staring some executor are much
higher that in regular case(without dynamic allocation) seems like SPARK-19755
is the core issue here - after half day of long running driver in client mode
almost 1/3 of slaves out of all mesos slaves could be marked as blacklisted.
the reasons for executor failures might be different and transient(e.g. port
collision)
I think I'll close this Jira as duplicate for SPARK-19755, WDYT?
Here just one example that out of 74 mesos slaves 16 already blacklisted
{code:java}
grep "Blacklisting Mesos slave" /var/log/mycomp/spark-myapp.log | wc -l
16{code}
> Application declines any offers when killed+active executors rich
> spark.dynamicAllocation.maxExecutors
> ------------------------------------------------------------------------------------------------------
>
> Key: SPARK-23423
> URL: https://issues.apache.org/jira/browse/SPARK-23423
> Project: Spark
> Issue Type: Bug
> Components: Mesos, Spark Core
> Affects Versions: 2.2.1
> Reporter: Igor Berman
> Priority: Major
> Labels: Mesos, dynamic_allocation
>
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend
> when running on Mesos with dynamic allocation on and limiting number of max
> executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource
> consumption(with some idle times in between), due to dyn.allocation it
> receives offers and then releases them after current chunk of work processed.
> Since at
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L573]
> the backend compares numExecutors < executorLimit and
> numExecutors is defined as slaves.values.map(_.taskIDs.size).sum and slaves
> holds all slaves ever "met", i.e. both active and killed (see comment
> [https://github.com/apache/spark/blob/master/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosCoarseGrainedSchedulerBackend.scala#L122)]
>
> On the other hand, number of taskIds should be updated due to statusUpdate,
> but suppose this update is lost(actually I don't see logs of 'is now
> TASK_KILLED') so this number of executors might be wrong
>
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation
> enabled") {
> setBackend(Map(
> "spark.dynamicAllocation.maxExecutors" -> "1",
> "spark.dynamicAllocation.enabled" -> "true",
> "spark.dynamicAllocation.testing" -> "true"))
> backend.doRequestTotalExecutors(1)
> val (mem, cpu) = (backend.executorMemory(sc), 4)
> val offer1 = createOffer("o1", "s1", mem, cpu)
> backend.resourceOffers(driver, List(offer1).asJava)
> verifyTaskLaunched(driver, "o1")
> backend.doKillExecutors(List("0"))
> verify(driver, times(1)).killTask(createTaskId("0"))
> val offer2 = createOffer("o2", "s2", mem, cpu)
> backend.resourceOffers(driver, List(offer2).asJava)
> verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
>
>
> Workaround: Don't set maxExecutors with dynamicAllocation on
>
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and
> probably can advice something([~vanzin], [~skonto], [~susanxhuynh])
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]