[
https://issues.apache.org/jira/browse/SPARK-4290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201496#comment-14201496
]
Xuefu Zhang commented on SPARK-4290:
------------------------------------
Hi [~rxin], by "out of box", do you mean
org.apache.hadoop.filecache.DistributedCache [1]? This is a MapReduce client
class, which is used when you submit a MR job. It basically tell MR framework
that your job needs these files put in distributed cache in order to run. Thus,
MR framework will copy these files to local file system of the tasks. The task
can access the local files via syslinks.
I don't know how this can be used out of box. First, Hive on Spark user may not
have MR client library. Secondly, there isn't MR framework that does the
copying.
Do you have an example on how I might achieve this?
[1]
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/filecache/DistributedCache.html
> Provide an equivalent functionality of distributed cache as MR does
> -------------------------------------------------------------------
>
> Key: SPARK-4290
> URL: https://issues.apache.org/jira/browse/SPARK-4290
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Reporter: Xuefu Zhang
>
> MapReduce allows client to specify files to be put in distributed cache for a
> job and the framework guarentees that the file will be available in local
> file system of a node where a task of the job runs and before the tasks
> actually starts. While this might be achieved with Yarn via hacks, it's not
> available in other clusters. It would be nice to have such an equivalent
> functionality like this in Spark.
> It would also complement Spark's broadcast variable, which may not be
> suitable in certain scenarios.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]