Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3543#discussion_r21658128
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext:
SparkContext)
def createParquetFile[A <: Product : TypeTag](
path: String,
allowExisting: Boolean = true,
- conf: Configuration = new Configuration()): SchemaRDD = {
--- End diff --
If we're going to use `CONFIGURATION_INSTANTIATION_LOCK` in multiple
places, then I think it makes sense to move `CONFIGURATION_INSTANTIATION_LOCK`
into `SparkHadoopUtil`, since that seems like a more logical place for it to
live than `HadoopRDD`. I like the idea of hiding the synchronization logic
behind a method like `SparkHadoopUtil.newConfiguration`.
Regarding whether `SparkContext.hadoopConfiguration` will lead to
thread-safety issues: I did a bit of research on this while developing a
workaround for the other configuration thread-safety issues and wrote [a series
of
comments](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160790)
citing cases of code "in the wild" that depend on mutating
`SparkContext.hadoopConfiguration`. For example, there are a lot of snippets
of code that look like this:
```scala
sc.hadoopConfiguration.set("es.resource", "syslog/entry")
output.saveAsHadoopFile[ESOutputFormat]("-")
```
In Spark 1.x, I don't think we'll be able to safely transition away from
using the shared `SparkContext.hadoopConfiguration` instance since there's so
much existing code that relies on the current behavior.
However, I think that there's much less risk of running into thread-safety
issues as a result of this. It seems fairly unlikely that you'll have multiple
threads mutating the shared configuration in the driver JVM. In executor JVMs,
most Hadoop `InputFormats` (and other classes) don't mutate configurations, so
we shouldn't run into issues; for those that do mutate, users can always enable
the `cloneConf` setting.
In a nutshell, I don't think that the shared `sc.hadoopConfiguration` is a
good design that we would choose if we were redesigning it, but using it here
seems consistent with the behavior that we have elsewhere in Spark as long as
we're stuck with this for 1.x.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]