Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/634#issuecomment-42172635
The scheduling and data locality come later.
This preceeds all that.
To give example - suppose we need 200 containers to run a job; as soon as
we start, we might get say 2 or 5 nodes allocated : it takes some time for yarn
to ramp up to a good number.
Now, if we start off the actual job before sufficient containers have been
allocated - and we get job failures since a) data cant be cached in memory b)
too much load on too low number of containers.
Task locality, scheduling comes later - but they are also impacted.
Assuming job does not fail, now all data is available on these small number
of containers while rest of the cluster is busy pulling data from them which
causes suboptimal performance (I have not seen job failures due to this - yet).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---