[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

mridulm Tue, 23 Sep 2014 04:25:09 -0700

Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/1486#issuecomment-56506066
  
    @pwendell This is not hadoop RDD specific functionality - it is a general 
requirement which can be leveraged by any RDD in spark - and hadoop RDD 
currently happens to have a usecase for this when dfs caching is used.
    The fact that preferred location is currently a String might be the 
limitation here : and so extending it for uri or whatever else will add 
overhead (including current patch).
    
    For example: RDD which pulls data from tachyon or other distributed memory 
stores, loading data into accelerator cards and specifying process local 
locality for the block, etc are all uses of the same functionality imo.
    
    If not addressed properly, when the next similar requirement comes along - 
either we will be rewriting this code; or adding more surgical hacks along same 
lines.
    
    
    If the expectation is that spark wont need to support these other 
requirements [1], then we can definitely punt on doing a proper design change.
    
    Given this is not user facing change (right ?), we can definitely take 
current approach and replace it later; or do a more principled solution upfront.
    @kayousterhout @markhamstra @mateiz any thoughts given this modifies 
TaskSetManager for addition of this feature ?
    
    
    [1] which is unlikely given mllib's rapid pace of development - it is 
fairly inevitable to have the need to support accelerator cards sooner rather 
than later - atleast given the arc of our past efforts with ml on spark.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-1767: Prefer HDFS-cached replicas when s...

Reply via email to