[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Xuefu Zhang (JIRA) Thu, 06 Nov 2014 18:59:51 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201496#comment-14201496
 ]


Xuefu Zhang commented on SPARK-4290:
------------------------------------

Hi [~rxin], by "out of box", do you mean 
org.apache.hadoop.filecache.DistributedCache [1]? This is a MapReduce client 
class, which is used when you submit a MR job. It basically tell MR framework 
that your job needs these files put in distributed cache in order to run. Thus, 
MR framework will copy these files to local file system of the tasks. The task 
can access the local files via syslinks.

I don't know how this can be used out of box. First, Hive on Spark user may not 
have MR client library. Secondly, there isn't MR framework that does the 
copying.

Do you have an example on how I might achieve this?

[1] 
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html

> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
>                 Key: SPARK-4290
>                 URL: https://issues.apache.org/jira/browse/SPARK-4290
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a 
> job and the framework guarentees that the file will be available in local 
> file system of a node where a task of the job runs and before the tasks 
> actually starts. While this might be achieved with Yarn via hacks, it's not 
> available in other clusters. It would be nice to have such an equivalent 
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be 
> suitable in certain scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-4290) Provide an equivalent functionality of distributed cache as MR does

Reply via email to