Github user aarondav commented on a diff in the pull request:
https://github.com/apache/spark/pull/1648#discussion_r15626994
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -127,26 +110,15 @@ class HadoopRDD[K, V](
private val createTime = new Date()
// Returns a JobConf that will be used on slaves to obtain input splits
for Hadoop reads.
- protected def getJobConf(): JobConf = {
- val conf: Configuration = broadcastedConf.value.value
- if (conf.isInstanceOf[JobConf]) {
- // A user-broadcasted JobConf was provided to the HadoopRDD, so
always use it.
- conf.asInstanceOf[JobConf]
- } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
- // getJobConf() has been called previously, so there is already a
local cache of the JobConf
- // needed by this RDD.
- HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
- } else {
- // Create a JobConf that will be cached and used across this RDD's
getJobConf() calls in the
- // local process. The local cache is accessed through
HadoopRDD.putCachedMetadata().
- // The caching helps minimize GC, since a JobConf can contain ~10KB
of temporary objects.
- // Synchronize to prevent ConcurrentModificationException
(Spark-1097, Hadoop-10456).
- HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
- val newJobConf = new JobConf(conf)
- initLocalJobConfFuncOpt.map(f => f(newJobConf))
- HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
- newJobConf
- }
+ protected def createJobConf(): JobConf = {
+ val conf: Configuration = serializableConf.value
--- End diff --
Is this guaranteed to return a new copy of the conf for every partition or
something? Because otherwise I'm not sure I see why we can safely remove the
lock.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---