Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/1486#issuecomment-54776543
Hey @cmccabe - thanks for looking at this. From what I can tell the current
approach has the drawback that if there are both cached-and-non-cached
locations, the non-cached locations will simply be ignored. This could actually
regress performance for workloads where having e.g. 3 machine-local replicas is
better than having 1 cached replica.
To get proper delay scheduling with this, where we triage from cached
copies to non-cached copies - I think it would be best to just use the existing
mechanism we have for preferring replicas that are cached on a specific node in
Sparks in-process. This corresponds to the PROCESS_LOCAL locality level.
To get this I think you can make a relatively surgical change which is that
Hadoop RDD should return a `TaskLocation` with the `executorId` populated if
there is an in-memory replica available on that node. Then we can change the
documentation a bit to explain that this field is a bit overloaded at the
moment to mean two things. You would need to add a lookup to see if we
presently have an executor on that node and find its ID.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]