[ https://issues.apache.org/jira/browse/SPARK-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294400#comment-14294400 ]
Josh Rosen commented on SPARK-5191: ----------------------------------- I think this is actually caused by an interaction between Python threads and the Python import mechanism. According to the Python documentation's section on [importing in threaded code|https://docs.python.org/2/library/threading.html#importing-in-threaded-code], {quote} other than in the main module, an import should not have the side effect of spawning a new thread and then waiting for that thread in any way. Failing to abide by this restriction can lead to a deadlock if the spawned thread directly or indirectly attempts to import a module. {quote} I think what's happening here is when you run {{b.py}}, the import mechanism is locked the entire time by the main thread since all of {{a.py}} is executed while handling the import statement in {{b.py}}. When the accumulator handling thread tries to perform an import, it becomes blocked by the main thread, leading to the deadlock that we've observed. Technically, I guess that the code presented here violates the thread-safety rules from the Python documentation, so one fix would be to rewrite your code so that the import doesn't have side-effects. Another alternative, perhaps more brittle, would be to modify PySpark so that we don't perform imports from threads. > Pyspark: scheduler hangs when importing a standalone pyspark app > ---------------------------------------------------------------- > > Key: SPARK-5191 > URL: https://issues.apache.org/jira/browse/SPARK-5191 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1 > Reporter: Daniel Liu > > In a.py: > {code} > from pyspark import SparkContext > sc = SparkContext("local", "test spark") > rdd = sc.parallelize(range(1, 10)) > print rdd.count() > {code} > In b.py: > {code} > from a import * > {code} > {{python a.py}} runs fine > {{python b.py}} will hang at TaskSchedulerImpl: Removed TaskSet 0.0, whose > tasks have all completed, from pool > {{./bin/spark-submit --py-files a.py b.py}} has the same problem -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org