git commit: SPARK-1097: Do not introduce deadlock while fixing concurrency bug

pwendell Wed, 16 Jul 2014 14:11:34 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-1.0 bf1ddc7b8 -> 91e7a71c6



SPARK-1097: Do not introduce deadlock while fixing concurrency bug

We recently added this lock on 'conf' in order to prevent concurrent creation. 
However, it turns out that this can introduce a deadlock because Hadoop also 
synchronizes on the Configuration objects when creating new Configurations (and 
they do so via a static REGISTRY which contains all created Configurations).

This fix forces all Spark initialization of Configuration objects to occur 
serially by using a static lock that we control, and thus also prevents 
introducing the deadlock.

Author: Aaron Davidson <[email protected]>

Closes #1409 from aarondav/1054 and squashes the following commits:

7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing 
concurrency bug
(cherry picked from commit 8867cd0bc2961fefed84901b8b14e9676ae6ab18)

Signed-off-by: Patrick Wendell <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/91e7a71c
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/91e7a71c
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/91e7a71c

Branch: refs/heads/branch-1.0
Commit: 91e7a71c68eb9ff0738c21bc7525fa89bd662993
Parents: bf1ddc7
Author: Aaron Davidson <[email protected]>
Authored: Wed Jul 16 14:10:17 2014 -0700
Committer: Patrick Wendell <[email protected]>
Committed: Wed Jul 16 14:10:33 2014 -0700

----------------------------------------------------------------------
 core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/91e7a71c/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
----------------------------------------------------------------------
diff --git a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala 
b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
index a55b226..d0a2241 100644
--- a/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala
@@ -139,8 +139,8 @@ class HadoopRDD[K, V](
       // Create a JobConf that will be cached and used across this RDD's 
getJobConf() calls in the
       // local process. The local cache is accessed through 
HadoopRDD.putCachedMetadata().
       // The caching helps minimize GC, since a JobConf can contain ~10KB of 
temporary objects.
-      // synchronize to prevent ConcurrentModificationException (Spark-1097, 
Hadoop-10456)
-      conf.synchronized {
+      // Synchronize to prevent ConcurrentModificationException (Spark-1097, 
Hadoop-10456).
+      HadoopRDD.CONFIGURATION_INSTANTIATION_LOCK.synchronized {
         val newJobConf = new JobConf(conf)
         initLocalJobConfFuncOpt.map(f => f(newJobConf))
         HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
@@ -231,6 +231,9 @@ class HadoopRDD[K, V](
 }
 
 private[spark] object HadoopRDD {
+  /** Constructing Configuration objects is not threadsafe, use this lock to 
serialize. */
+  val CONFIGURATION_INSTANTIATION_LOCK = new Object()
+
   /**
    * The three methods below are helpers for accessing the local map, a 
property of the SparkEnv of
    * the local process.

git commit: SPARK-1097: Do not introduce deadlock while fixing concurrency bug

Reply via email to