[
https://issues.apache.org/jira/browse/SPARK-7410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-7410:
-----------------------------
Target Version/s: (was: 1.4.1)
> Add option to avoid broadcasting configuration with newAPIHadoopFile
> --------------------------------------------------------------------
>
> Key: SPARK-7410
> URL: https://issues.apache.org/jira/browse/SPARK-7410
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.4.0
> Reporter: Sandy Ryza
>
> I'm working with a Spark application that creates thousands of HadoopRDDs and
> unions them together. Certain details of the way the data is stored require
> this.
> Creating ten thousand of these RDDs takes about 10 minutes, even before any
> of them is used in an action. I dug into why this takes so long and it looks
> like the overhead of broadcasting the Hadoop configuration is taking up most
> of the time. In this case, the broadcasting isn't helpful because each
> HadoopRDD only corresponds to one or two tasks. When I reverted the original
> change that switched to broadcasting configurations, the time it took to
> instantiate these RDDs improved 10x.
> It would be nice if there was a way to turn this broadcasting off. Either
> through a Spark configuration option, a Hadoop configuration option, or an
> argument to hadoopFile / newAPIHadoopFile.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]