Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/17480#discussion_r111436461
  
    --- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
    @@ -249,7 +249,14 @@ private[spark] class ExecutorAllocationManager(
        * yarn-client mode when AM re-registers after a failure.
        */
       def reset(): Unit = synchronized {
    -    initializing = true
    +    /**
    +     * When some tasks need to be scheduled and initial executor = 0, 
resetting the initializing
    +     * field may cause it to not be set to false in yarn.
    +     * SPARK-20079: https://issues.apache.org/jira/browse/SPARK-20079
    +     */
    +    if (maxNumExecutorsNeeded() == 0) {
    +      initializing = true
    --- End diff --
    
    You're explaining what the code does as a justification for why a hacky fix 
should be applied to this issue. I'm asking why the code needs to behave like 
that. If there's no actual *need* for the code to behave like that, it should 
be fixed.
    
    Basically, imagine that at t1 the AM dies, and at t2 a new AM comes up and 
registers. What should happen from the driver's point of view? (Note, what 
*should happen*, not what the code does.)
    
    In my view, the answer is "nothing". The driver knows what it needs, so the 
new AM should start as closely as possible to the state of the previous AM. 
Doing that might be hard (e.g. caching the complete list of known containers 
somewhere, probably the driver), so some things are sub-optimal (containers 
will be re-started). But as far as numbers go, the new AM should basically 
start up with the same number of containers the previous AM was managing 
(ignoring the time needed to start them up).
    
    If the AM doesn't do that currently, then why is that? It asks the driver 
for state related to the previous AM already (see 
`RetrieveLastAllocatedExecutorId` call). Why can't that call return more state 
needed for the new AM to sync up to what the driver needs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to