JoshRosen opened a new pull request #24714: [SPARK-27846] Eagerly compute 
Configuration.properties in sc.hadoopConfiguration
URL: https://github.com/apache/spark/pull/24714
 
 
   ## What changes were proposed in this pull request?
   
   Hadoop `Configuration` has an internal `properties` map which is lazily 
initialized. Initialization of this field, done in the private 
`Configuration.getProps()` method, is rather expensive because it ends up 
parsing XML configuration files. When cloning a `Configuration`, this 
`properties` field is cloned if it has been initialized.
   
   In some cases it's possible that `sc.hadoopConfiguration` never ends up 
computing this `properties` field, leading to performance problems when this 
configuration is cloned in `SessionState.newHadoopConf()` because each cloned 
`Configuration` needs to re-parse configuration XML files from disk.
   
   To avoid this problem, we can call `Configuration.size()` to trigger a call 
to `getProps()`, ensuring that this expensive computation is cached and re-used 
when cloning configurations.
   
   I discovered this problem while performance profiling the Spark ThriftServer 
while running a SQL fuzzing workload.
   
   ## How was this patch tested?
   
   Examined YourKit profiles before and after my change.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to