[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Xuefu Zhang (JIRA) Thu, 06 Nov 2014 20:11:56 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201561#comment-14201561
 ]


Xuefu Zhang commented on SPARK-4290:
------------------------------------

Hi [~rxin], from the documentation of above java class, I read
{quote}
Its efficiency stems from the fact that the files are only copied once per job 
and the ability to cache archives which are un-archived on the slaves.
{quote}

Two things are suggested:
1. one copy per job. If multiple tasks of a job running on a node, there is  
still only one copy from HDFS to local. 
2. unarchive the archives on the slaves.

In your #2, it will be desirable it can achieve one copy per job (not one copy 
per request). That is, subsequent request for the file from other tasks of the 
same job will directly read from local.

Of course, this is something on the surface. Actual implementaton can be much 
more complicated.

> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
>                 Key: SPARK-4290
>                 URL: https://issues.apache.org/jira/browse/SPARK-4290
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Reply via email to