Sandy Ryza created SPARK-7410:
---------------------------------
Summary: Add option to avoid broadcasting configuration with
newAPIHadoopFile
Key: SPARK-7410
URL: https://issues.apache.org/jira/browse/SPARK-7410
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.4.0
Reporter: Sandy Ryza
I'm working with a Spark application that creates thousands of HadoopRDDs and
unions them together. Certain details of the way the data is stored require
this.
Creating ten thousand of these RDDs takes about 10 minutes, even before any of
them is used in an action. I dug into why this takes so long and it looks like
the overhead of broadcasting the Hadoop configuration is taking up most of the
time. In this case, the broadcasting isn't helpful because each HadoopRDD only
corresponds to one or two tasks. When I reverted the original change that
switched to broadcasting configurations, the time it took to instantiate these
RDDs improved 10x.
It would be nice if there was a way to turn this broadcasting off. Either
through a Spark configuration option, a Hadoop configuration option, or an
argument to hadoopFile / newAPIHadoopFile.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]