HyukjinKwon commented on a change in pull request #24714: [SPARK-27846][CORE]
Eagerly compute Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714#discussion_r287629486
##########
File path: core/src/main/scala/org/apache/spark/SparkContext.scala
##########
@@ -285,7 +285,18 @@ class SparkContext(config: SparkConf) extends Logging {
* @note As it will be reused in all Hadoop RDDs, it's better not to modify
it unless you
* plan to set some global configurations for all Hadoop RDDs.
*/
- def hadoopConfiguration: Configuration = _hadoopConfiguration
+ def hadoopConfiguration: Configuration = {
+ // Performance optimization: this dummy call to .size() triggers eager
evaluation of
+ // Configuration's internal `properties` field, guaranteeing that it will
be computed and
+ // cached before SessionState.newHadoopConf() uses
`sc.hadoopConfiguration` to create
+ // a new per-session Configuration. If `properties` has not been computed
by that time
+ // then each newly-created Configuration will perform its own expensive IO
and XML
+ // parsing to load configuration defaults and populate its own properties.
By ensuring
+ // that we've pre-computed the parent's properties, the child
Configuration will simply
+ // clone the parent's properties.
+ _hadoopConfiguration.size()
Review comment:
@JoshRosen, out of curiosity, how much did this improve the performance? -
just rough estimation is fine. I thought here' in a non-critical path so I
thought better readability is prioritized if the gain is rather minor.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]