Merge pull request #20 from harveyfeng/hadoop-config-cache

Allow users to pass broadcasted Configurations and cache InputFormats across 
Hadoop file reads.

Note: originally from https://github.com/mesos/spark/pull/942

Currently motivated by Shark queries on Hive-partitioned tables, where there's 
a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). 
The only thing different about those JobConfs is the input path - the Hadoop 
Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new 
API might reduce computation latencies a little bit for high-frequency 
FileInputDStreams (which only uses the new API right now).

As a small bonus, added InputFormats caching, to avoid reflection calls for 
every RDD#compute().

Few other notes:

Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to 
avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() 
method for Configuration, so there isn't a good way to determine when 
configuration properties have changed.


Project: http://git-wip-us.apache.org/repos/asf/incubator-spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-spark/commit/4a25b116
Tree: http://git-wip-us.apache.org/repos/asf/incubator-spark/tree/4a25b116
Diff: http://git-wip-us.apache.org/repos/asf/incubator-spark/diff/4a25b116

Branch: refs/heads/master
Commit: 4a25b116d4e451afdf10fc4f018c383ed2c7789a
Parents: 8fc68d0 6a2bbec
Author: Matei Zaharia <ma...@eecs.berkeley.edu>
Authored: Sat Oct 5 19:28:55 2013 -0700
Committer: Matei Zaharia <ma...@eecs.berkeley.edu>
Committed: Sat Oct 5 19:28:55 2013 -0700

----------------------------------------------------------------------
 .../scala/org/apache/spark/CacheManager.scala   |   4 +-
 .../scala/org/apache/spark/SparkContext.scala   |  39 ++++--
 .../apache/spark/deploy/SparkHadoopUtil.scala   |  12 +-
 .../scala/org/apache/spark/rdd/HadoopRDD.scala  | 140 ++++++++++++++++---
 4 files changed, 161 insertions(+), 34 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-spark/blob/4a25b116/core/src/main/scala/org/apache/spark/SparkContext.scala
----------------------------------------------------------------------

Reply via email to