Github user mindprince commented on the pull request:

    https://github.com/apache/spark/pull/5233#issuecomment-87160289
  
    Hi @sryza
    
    We faced an issue where not doing this was causing incorrect accounting of 
the queue's AM Resources.
    
    As you may be aware that the YARN fair scheduler has a queue property 
called `maxAMShare` - which limits the fraction of the queue's fair share that 
can be used to run application masters.
    
    There is a variable called `amResourceUsage` which tracks the usage of AMs 
in a queue.
    
    [This is the 
block](https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSAppAttempt.java#L525-L529)
 in `FSAppAttempt.java` which increases `amResourceUsage`:
    ```
          if (getLiveContainers().size() == 1 && !getUnmanagedAM()) {
            getQueue().addAMResourceUsage(container.getResource());
            setAmRunning(true);
          }
    ```
    
    [The following 
block](https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FSLeafQueue.java#L91-L96)
 in `FSLeafQueue.java` decreases `amResourceUsage`:
    ```
    public boolean removeApp(FSAppAttempt app) {
        if (runnableApps.remove(app)) {
          // Update AM resource usage
          if (app.isAmRunning() && app.getAMResource() != null) {
            Resources.subtractFrom(amResourceUsage, app.getAMResource());
          }
      .
      .
    ```
    
    So one issue (in my opinion) on YARN's side is that it is using 
`getLiveContainers().size()` to detect whether a container is an AM or not.
    
    Now to main issue - the spark AM goes down without releasing the 
containers. So, this is what I observed stepping through the code:
    ```
    Spark AM starts: request 1024MB
    That is allocated.
    numLiveContainers for this app = 1.
    amResourceUsage for queue = 1024MB
    
    Spark executor starts: request1 6144MB
    This is allocated.
    numLiveContainers for this app = 2.
    amResourceUsage is not changed.
    
    Spark Command requests another container: request2 6144MB
    Assume for some reason this is not allocated. (Insufficient resources in 
the cluster.)
    
    Now, I see that a container with memory = 1024MB is completed. (This is the 
AM container.)
    numLiveContainers = 1 
    
    A container with memory = 6144MB is completed.
    numLiveContainers = 0
    
    The request2 for 6144MB which was unsuccessful earlier is now successful.
    numLiveContainers = 1
    amResourceUsage = 1024+6144 (This shouldn't have happened but happens 
because live containers = 1.)
    
    Now app is complete and removeApp is called.
    It sets amResourceUsage to 7168 - 1024 = 6144. Because the AM for spark app 
took 1024.
    ```
    After a couple of spark applications are launched, the cluster becomes 
deadlocked - because no more AMs are admitted in the queue as the 
amResourceUsage is incorrectly too high.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to