Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18651#discussion_r128831747
--- Diff:
resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
---
@@ -525,9 +534,11 @@ private[yarn] class YarnAllocator(
} catch {
case NonFatal(e) =>
logError(s"Failed to launch executor $executorId on
container $containerId", e)
- // Assigned container should be released immediately to
avoid unnecessary resource
- // occupation.
+ // Assigned container should be released immediately
+ // to avoid unnecessary resource occupation.
amClient.releaseAssignedContainer(containerId)
+ } finally {
+ numExecutorsStarting.decrementAndGet()
--- End diff --
yes but its a bug right now as the numbers can be wrong. Are you looking at
the synchronization?
Right now everything is called synchronized up to the point of launcher
pool to do the ExecutorRunnable. At this point running is not incremented,
pending is decremented and we now increment Starting. That is fine.
But when the ExecutorRunnable finishes the only place its called
synchronized is in updateInternalState. This right now increments running but
does not decrement starting. if updateResourceRequests gets called (which is
synchronized), Right after updateInternalState (which leave the syncrhonized)
but before the finally block executes and decrements starting the total number
can be more then it really is. That executor is counted as both running and
starting
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]