Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/634#issuecomment-42172635
  
    The scheduling and data locality come later.
    This preceeds all that.
    
    To give example - suppose we need 200 containers to run a job; as soon as 
we start, we might get say 2 or 5 nodes allocated : it takes some time for yarn 
to ramp up to a good number.
    
    Now, if we start off the actual job before sufficient containers have been 
allocated - and we get job failures since a) data cant be cached in memory b) 
too much load on too low number of containers.
    
    Task locality, scheduling comes later - but they are also impacted.
    Assuming job does not fail, now all data is available on these small number 
of containers while rest of the cluster is busy pulling data from them which 
causes suboptimal performance (I have not seen job failures due to this - yet).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to