[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Sandy Ryza (JIRA) Thu, 06 Nov 2014 20:17:00 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201572#comment-14201572
 ]


Sandy Ryza commented on SPARK-4290:
-----------------------------------

If you call SparkContext#addFile, the file will be pulled onto the local disks 
of the executors a single time.

It's true that for large clusters, writing to HDFS with high replication could 
be more efficient than sending from the driver to every executor.  It might be 
worth implementing / using a more sophisticated broadcast mechanism rather than 
adding a new API.


> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
>                 Key: SPARK-4290
>                 URL: https://issues.apache.org/jira/browse/SPARK-4290
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Reply via email to