Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1648#discussion_r15627025
  
    --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
    @@ -127,26 +110,15 @@ class HadoopRDD[K, V](
       private val createTime = new Date()
     
       // Returns a JobConf that will be used on slaves to obtain input splits 
for Hadoop reads.
    -  protected def getJobConf(): JobConf = {
    -    val conf: Configuration = broadcastedConf.value.value
    -    if (conf.isInstanceOf[JobConf]) {
    -      // A user-broadcasted JobConf was provided to the HadoopRDD, so 
always use it.
    -      conf.asInstanceOf[JobConf]
    -    } else if (HadoopRDD.containsCachedMetadata(jobConfCacheKey)) {
    -      // getJobConf() has been called previously, so there is already a 
local cache of the JobConf
    -      // needed by this RDD.
    -      HadoopRDD.getCachedMetadata(jobConfCacheKey).asInstanceOf[JobConf]
    -    } else {
    -      // Create a JobConf that will be cached and used across this RDD's 
getJobConf() calls in the
    -      // local process. The local cache is accessed through 
HadoopRDD.putCachedMetadata().
    -      // The caching helps minimize GC, since a JobConf can contain ~10KB 
of temporary objects.
    -      // Synchronize to prevent ConcurrentModificationException 
(Spark-1097, Hadoop-10456).
    -      HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
    -        val newJobConf = new JobConf(conf)
    -        initLocalJobConfFuncOpt.map(f => f(newJobConf))
    -        HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
    -        newJobConf
    -      }
    +  protected def createJobConf(): JobConf = {
    +    val conf: Configuration = serializableConf.value
    --- End diff --
    
    It is because RDD objects are not reused at all. Each task gets its own 
deserialized copy of the HadoopRDD and the conf.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to