[GitHub] spark pull request: [SPARK-4180] Prevent creation of multiple acti...

JoshRosen Wed, 05 Nov 2014 15:42:55 -0800

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/3121


    [SPARK-4180] Prevent creation of multiple active SparkContexts

    This patch adds error-detection logic to throw an exception when attempting 
to create multiple active SparkContexts in the same JVM, since this is 
currently unsupported and has been known to cause confusing behavior (see 
SPARK-2243 for more details).
    
    **The solution implemented here is only a partial fix.**  A complete fix 
would have the following properties:
    
    1. Only one SparkContext may ever be under construction at any given time.
    2. Once a SparkContext has been successfully constructed, any subsequent 
construction attempts should fail until the active SparkContext is stopped.
    3. If the SparkContext constructor throws an exception, then all resources 
created in the constructor should be cleaned up (SPARK-4194).
    4. If a user attempts to create a SparkContext but the creation fails, then 
the user should be able to create new SparkContexts.
    
    This PR only provides 1) and 2) (and partial support for 4) in two limited 
cases); we should be able to provide all of these properties, but the correct 
fix will involve larger changes to SparkContext's construction / 
initialization, so we'll target it for a different Spark release.
    
    ### The correct solution:
    
    I think that the correct way to do this would be to move the construction 
of SparkContext's dependencies into a static method in the SparkContext 
companion object.  Specifically, we could make the default SparkContext 
constructor `private` and change it to accept a `SparkContextDependencies` 
object that contains all of SparkContext's dependencies (e.g. DAGScheduler, 
ContextCleaner, etc.).  Secondary constructors could call a method on the 
SparkContext companion object to create the `SparkContextDependencies` and pass 
the result to the primary SparkContext constructor.  For example:
    
    ```scala
    class SparkContext private (deps: SparkContextDependencies) {
      def this(conf: SparkConf) {
        this(SparkContext.getDeps(conf))
      }
    }
    
    object SparkContext(
      private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = 
synchronized {
        if (anotherSparkContextIsActive) { throw Exception(...) }
        var dagScheduler: DAGScheduler = null
        try {
            dagScheduler = new DAGScheduler(...)
            [...]
        } catch {
          case e: Exception =>
             Option(dagScheduler).foreach(_.stop())
              [...]
        }
        SparkContextDependencies(dagScheduler, ....)
      }
    } 
    ```
    
    This gives us mutual exclusion and ensures that any resources created 
during the failed SparkContext initialization are properly cleaned up.
    
    This indirection is necessary to maintain binary compatibility.  In 
retrospect, it would have been nice if SparkContext had no private constructors 
and could only be created through builder / factory methods on its companion 
object, since this buys us lots of flexibility and makes dependency injection 
easier.
    
    ### Alternative solutions:
    
    As an alternative solution, we could refactor SparkContext's primary 
constructor to perform all object creation in a giant `try-finally` block.  
Unfortunately, this will require us to turn a bunch of `vals` into `vars` so 
that they can be assigned from the `try` block.  If we still want `vals`, we 
could wrap each `val` in its own `try` block (since the try block can return a 
value), but this will lead to extremely messy code and won't guard against the 
introduction of future code which doesn't properly handle failures.
    
    The more complex approach outlined above gives us some nice dependency 
injection benefits, so I think that might be preferable to a `var`-ification.
    
    ### The limitations of this PR's solution:
    
    This PR ensures that _an error_ will occur if multiple SparkContexts are 
active at the same time, even if multiple threads race to construct those 
contexts.  The key problem is the fact that we don't hold the 
`SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK` for the entire duration of the 
constructor, so we don't have a way to distinguish between a SparkContext 
creation attempt that failed because another SparkContext was undergoing 
construction and a case where a previous construction attempt failed and left 
some state that wasn't cleaned up.  This means that it's possible for 
SparkContext's constructor to throw an exception in a way that prevents any 
future creations of SparkContext until the JVM is restarted.
    
    This is unlikely to be a problem in practice, though, especially since I've 
added special-casing so that this issue doesn't affect the most common causes 
of SparkContext construction errors (e.g. forgetting to set the master URL or 
application name).  In case it is, I've added an undocumented 
`spark.driver.disableMultipleSparkContextsErrorChecking` option which turns 
this error into a warning.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-4180

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3121.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3121
    
----
commit afaa7e37cce76da8173ff556d217a67a1633426a
Author: Josh Rosen <[email protected]>
Date:   2014-11-05T23:06:13Z

    [SPARK-4180] Prevent creations of multiple active SparkContexts.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4180] Prevent creation of multiple acti...

Reply via email to