[
https://issues.apache.org/jira/browse/SPARK-33120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17212670#comment-17212670
]
Dongjoon Hyun edited comment on SPARK-33120 at 10/12/20, 9:00 PM:
------------------------------------------------------------------
Hi, [~tsmock]. What is the benefit you need here?
bq. I would like to avoid copying all of the files to every executor until it
is actually needed.
was (Author: dongjoon):
Hi, [~tsmock]. What is the benefit you need here?
> I would like to avoid copying all of the files to every executor until it is
> actually needed.
> Lazy Load of SparkContext.addFiles
> ----------------------------------
>
> Key: SPARK-33120
> URL: https://issues.apache.org/jira/browse/SPARK-33120
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.1
> Environment: Mac OS X (2 systems), workload to eventually be run on
> Amazon EMR.
> Java 11 application.
> Reporter: Taylor Smock
> Priority: Minor
>
> In my spark job, I may have various random files that may or may not be used
> by each task.
> I would like to avoid copying all of the files to every executor until it is
> actually needed.
>
> What I've tried:
> * SparkContext.addFiles w/ SparkFiles.get . In testing, all files were
> distributed to all clients.
> * Broadcast variables. Since I _don't_ know what files I'm going to need
> until I have started the task, I have to broadcast all the data at once,
> which leads to nodes getting data, and then caching it to disk. In short, the
> same issues as SparkContext.addFiles, but with the added benefit of having
> the ability to create a mapping of paths to files.
> What I would like to see:
> * SparkContext.addFiles(file, Enum.LazyLoad) w/ SparkFiles.get(file,
> Enum.WaitForAvailability) or Future<?> future = SparkFiles.get(file)
>
>
> Notes:
> https://issues.apache.org/jira/browse/SPARK-4290?focusedCommentId=14205346&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14205346
> indicated that `SparkFiles.get` would be required to get the data on the
> local driver, but in my testing that did not appear to be the case.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]