JoshRosen commented on a change in pull request #24714: [SPARK-27846][CORE]
Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714#discussion_r287650756
##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -285,7 +285,18 @@ class SparkContext(config: SparkConf) extends Logging {
* @note As it will be reused in all Hadoop RDDs, it's better not to modify
it unless you
* plan to set some global configurations for all Hadoop RDDs.
*/
- def hadoopConfiguration: Configuration = _hadoopConfiguration
+ def hadoopConfiguration: Configuration = {
+ // Performance optimization: this dummy call to .size() triggers eager
evaluation of
+ // Configuration's internal `properties` field, guaranteeing that it will
be computed and
+ // cached before SessionState.newHadoopConf() uses
`sc.hadoopConfiguration` to create
+ // a new per-session Configuration. If `properties` has not been computed
by that time
+ // then each newly-created Configuration will perform its own expensive IO
and XML
+ // parsing to load configuration defaults and populate its own properties.
By ensuring
+ // that we've pre-computed the parent's properties, the child
Configuration will simply
+ // clone the parent's properties.
+ _hadoopConfiguration.size()
Review comment:
It's a little hard to get a precise end-to-end performance measurement here,
unfortunately. In steady state this is't huge, but fixing it cleans up a ton of
noise in Java profiler output: after this change I no longer see a bunch of
frames from `org.apache.xerces` / `ClassLoader.getResource`, etc.
I submitted this mostly for the sake upstreaming changes from my local
fuzz-testing branch (which runs tons of short queries through the
ThriftServer).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]