[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

JoshRosen Wed, 10 Dec 2014 20:58:41 -0800

Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3543#discussion_r21658128
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
    @@ -262,7 +263,7 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
       def createParquetFile[A <: Product : TypeTag](
           path: String,
           allowExisting: Boolean = true,
    -      conf: Configuration = new Configuration()): SchemaRDD = {
    --- End diff --
    
    If we're going to use `CONFIGURATION_INSTANTIATION_LOCK` in multiple 
places, then I think it makes sense to move `CONFIGURATION_INSTANTIATION_LOCK` 
into `SparkHadoopUtil`, since that seems like a more logical place for it to 
live than `HadoopRDD`.  I like the idea of hiding the synchronization logic 
behind a method like `SparkHadoopUtil.newConfiguration`.
    
    Regarding whether `SparkContext.hadoopConfiguration` will lead to 
thread-safety issues: I did a bit of research on this while developing a 
workaround for the other configuration thread-safety issues and wrote [a series 
of 
comments](https://issues.apache.org/jira/browse/SPARK-2546?focusedCommentId=14160790&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14160790)
 citing cases of code "in the wild" that depend on mutating 
`SparkContext.hadoopConfiguration`.  For example, there are a lot of snippets 
of code that look like this:
    
    ```scala
    sc.hadoopConfiguration.set("es.resource", "syslog/entry")
    output.saveAsHadoopFile[ESOutputFormat]("-")
    ```
    
    In Spark 1.x, I don't think we'll be able to safely transition away from 
using the shared `SparkContext.hadoopConfiguration` instance since there's so 
much existing code that relies on the current behavior.
    
    However, I think that there's much less risk of running into thread-safety 
issues as a result of this.  It seems fairly unlikely that you'll have multiple 
threads mutating the shared configuration in the driver JVM.  In executor JVMs, 
most Hadoop `InputFormats` (and other classes) don't mutate configurations, so 
we shouldn't run into issues; for those that do mutate, users can always enable 
the `cloneConf` setting.
    
    In a nutshell, I don't think that the shared `sc.hadoopConfiguration` is a 
good design that we would choose if we were redesigning it, but using it here 
seems consistent with the behavior that we have elsewhere in Spark as long as 
we're stuck with this for 1.x.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4229] Create hadoop configuration in a ...

Reply via email to