Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/1486#issuecomment-56506066
@pwendell This is not hadoop RDD specific functionality - it is a general
requirement which can be leveraged by any RDD in spark - and hadoop RDD
currently happens to have a usecase for this when dfs caching is used.
The fact that preferred location is currently a String might be the
limitation here : and so extending it for uri or whatever else will add
overhead (including current patch).
For example: RDD which pulls data from tachyon or other distributed memory
stores, loading data into accelerator cards and specifying process local
locality for the block, etc are all uses of the same functionality imo.
If not addressed properly, when the next similar requirement comes along -
either we will be rewriting this code; or adding more surgical hacks along same
lines.
If the expectation is that spark wont need to support these other
requirements [1], then we can definitely punt on doing a proper design change.
Given this is not user facing change (right ?), we can definitely take
current approach and replace it later; or do a more principled solution upfront.
@kayousterhout @markhamstra @mateiz any thoughts given this modifies
TaskSetManager for addition of this feature ?
[1] which is unlikely given mllib's rapid pace of development - it is
fairly inevitable to have the need to support accelerator cards sooner rather
than later - atleast given the arc of our past efforts with ml on spark.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]