Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/634#issuecomment-42172635 The scheduling and data locality come later. This preceeds all that. To give example - suppose we need 200 containers to run a job; as soon as we start, we might get say 2 or 5 nodes allocated : it takes some time for yarn to ramp up to a good number. Now, if we start off the actual job before sufficient containers have been allocated - and we get job failures since a) data cant be cached in memory b) too much load on too low number of containers. Task locality, scheduling comes later - but they are also impacted. Assuming job does not fail, now all data is available on these small number of containers while rest of the cluster is busy pulling data from them which causes suboptimal performance (I have not seen job failures due to this - yet).
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---