JoshRosen commented on a change in pull request #24714: [SPARK-27846][CORE] 
Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714#discussion_r287650756
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/SparkContext.scala
 ##########
 @@ -285,7 +285,18 @@ class SparkContext(config: SparkConf) extends Logging {
    * @note As it will be reused in all Hadoop RDDs, it's better not to modify 
it unless you
    * plan to set some global configurations for all Hadoop RDDs.
    */
-  def hadoopConfiguration: Configuration = _hadoopConfiguration
+  def hadoopConfiguration: Configuration = {
+    // Performance optimization: this dummy call to .size() triggers eager 
evaluation of
+    // Configuration's internal  `properties` field, guaranteeing that it will 
be computed and
+    // cached before SessionState.newHadoopConf() uses 
`sc.hadoopConfiguration` to create
+    // a new per-session Configuration. If `properties` has not been computed 
by that time
+    // then each newly-created Configuration will perform its own expensive IO 
and XML
+    // parsing to load configuration defaults and populate its own properties. 
By ensuring
+    // that we've pre-computed the parent's properties, the child 
Configuration will simply
+    // clone the parent's properties.
+    _hadoopConfiguration.size()
 
 Review comment:
   It's a little hard to get a precise end-to-end performance measurement here, 
unfortunately. In steady state this is't huge, but fixing it cleans up a ton of 
noise in Java profiler output: after this change I no longer see a bunch of 
frames from `org.apache.xerces` / `ClassLoader.getResource`, etc.
   
   I submitted this mostly for the sake upstreaming changes from my local 
fuzz-testing branch (which runs tons of short queries through the 
ThriftServer).  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to