[ 
https://issues.apache.org/jira/browse/SPARK-5191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14294400#comment-14294400
 ] 

Josh Rosen commented on SPARK-5191:
-----------------------------------

I think this is actually caused by an interaction between Python threads and 
the Python import mechanism.  According to the Python documentation's section 
on [importing in threaded 
code|https://docs.python.org/2/library/threading.html#importing-in-threaded-code],
 

{quote}
other than in the main module, an import should not have the side effect of 
spawning a new thread and then waiting for that thread in any way. Failing to 
abide by this restriction can lead to a deadlock if the spawned thread directly 
or indirectly attempts to import a module.
{quote}

I think what's happening here is when you run {{b.py}}, the import mechanism is 
locked the entire time by the main thread since all of {{a.py}} is executed 
while handling the import statement in {{b.py}}.  When the accumulator handling 
thread tries to perform an import, it becomes blocked by the main thread, 
leading to the deadlock that we've observed.

Technically, I guess that the code presented here violates the thread-safety 
rules from the Python documentation, so one fix would be to rewrite your code 
so that the import doesn't have side-effects.  Another alternative, perhaps 
more brittle, would be to modify PySpark so that we don't perform imports from 
threads.

> Pyspark: scheduler hangs when importing a standalone pyspark app
> ----------------------------------------------------------------
>
>                 Key: SPARK-5191
>                 URL: https://issues.apache.org/jira/browse/SPARK-5191
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.1.1, 1.3.0, 1.2.1
>            Reporter: Daniel Liu
>
> In a.py:
> {code}
> from pyspark import SparkContext
> sc = SparkContext("local", "test spark")
> rdd = sc.parallelize(range(1, 10))
> print rdd.count()
> {code}
> In b.py:
> {code}
> from a import *
> {code}
> {{python a.py}} runs fine
> {{python b.py}} will hang at TaskSchedulerImpl: Removed TaskSet 0.0, whose 
> tasks have all completed, from pool
> {{./bin/spark-submit --py-files a.py b.py}} has the same problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to