GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/3121
[SPARK-4180] Prevent creation of multiple active SparkContexts
This patch adds error-detection logic to throw an exception when attempting
to create multiple active SparkContexts in the same JVM, since this is
currently unsupported and has been known to cause confusing behavior (see
SPARK-2243 for more details).
**The solution implemented here is only a partial fix.** A complete fix
would have the following properties:
1. Only one SparkContext may ever be under construction at any given time.
2. Once a SparkContext has been successfully constructed, any subsequent
construction attempts should fail until the active SparkContext is stopped.
3. If the SparkContext constructor throws an exception, then all resources
created in the constructor should be cleaned up (SPARK-4194).
4. If a user attempts to create a SparkContext but the creation fails, then
the user should be able to create new SparkContexts.
This PR only provides 1) and 2) (and partial support for 4) in two limited
cases); we should be able to provide all of these properties, but the correct
fix will involve larger changes to SparkContext's construction /
initialization, so we'll target it for a different Spark release.
### The correct solution:
I think that the correct way to do this would be to move the construction
of SparkContext's dependencies into a static method in the SparkContext
companion object. Specifically, we could make the default SparkContext
constructor `private` and change it to accept a `SparkContextDependencies`
object that contains all of SparkContext's dependencies (e.g. DAGScheduler,
ContextCleaner, etc.). Secondary constructors could call a method on the
SparkContext companion object to create the `SparkContextDependencies` and pass
the result to the primary SparkContext constructor. For example:
```scala
class SparkContext private (deps: SparkContextDependencies) {
def this(conf: SparkConf) {
this(SparkContext.getDeps(conf))
}
}
object SparkContext(
private[spark] def getDeps(conf: SparkConf): SparkContextDependencies =
synchronized {
if (anotherSparkContextIsActive) { throw Exception(...) }
var dagScheduler: DAGScheduler = null
try {
dagScheduler = new DAGScheduler(...)
[...]
} catch {
case e: Exception =>
Option(dagScheduler).foreach(_.stop())
[...]
}
SparkContextDependencies(dagScheduler, ....)
}
}
```
This gives us mutual exclusion and ensures that any resources created
during the failed SparkContext initialization are properly cleaned up.
This indirection is necessary to maintain binary compatibility. In
retrospect, it would have been nice if SparkContext had no private constructors
and could only be created through builder / factory methods on its companion
object, since this buys us lots of flexibility and makes dependency injection
easier.
### Alternative solutions:
As an alternative solution, we could refactor SparkContext's primary
constructor to perform all object creation in a giant `try-finally` block.
Unfortunately, this will require us to turn a bunch of `vals` into `vars` so
that they can be assigned from the `try` block. If we still want `vals`, we
could wrap each `val` in its own `try` block (since the try block can return a
value), but this will lead to extremely messy code and won't guard against the
introduction of future code which doesn't properly handle failures.
The more complex approach outlined above gives us some nice dependency
injection benefits, so I think that might be preferable to a `var`-ification.
### The limitations of this PR's solution:
This PR ensures that _an error_ will occur if multiple SparkContexts are
active at the same time, even if multiple threads race to construct those
contexts. The key problem is the fact that we don't hold the
`SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK` for the entire duration of the
constructor, so we don't have a way to distinguish between a SparkContext
creation attempt that failed because another SparkContext was undergoing
construction and a case where a previous construction attempt failed and left
some state that wasn't cleaned up. This means that it's possible for
SparkContext's constructor to throw an exception in a way that prevents any
future creations of SparkContext until the JVM is restarted.
This is unlikely to be a problem in practice, though, especially since I've
added special-casing so that this issue doesn't affect the most common causes
of SparkContext construction errors (e.g. forgetting to set the master URL or
application name). In case it is, I've added an undocumented
`spark.driver.disableMultipleSparkContextsErrorChecking` option which turns
this error into a warning.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark SPARK-4180
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3121.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3121
----
commit afaa7e37cce76da8173ff556d217a67a1633426a
Author: Josh Rosen <[email protected]>
Date: 2014-11-05T23:06:13Z
[SPARK-4180] Prevent creations of multiple active SparkContexts.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]