[ 
https://issues.apache.org/jira/browse/SPARK-3910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268332#comment-14268332
 ] 

Josh Rosen commented on SPARK-3910:
-----------------------------------

It looks like this was fixed in SPARK-4348, but it turns out that we're now 
hitting this error when running PySpark tests in Jenkins jobs for maintenance 
branches (it turns out that Jenkins wasn't running these tests before for those 
branches, so it's not clear when this problem was introduced).  I'll see if I 
can figure out a fix for the backport branches.

> ./python/pyspark/mllib/classification.py doctests fails with module name 
> pollution
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-3910
>                 URL: https://issues.apache.org/jira/browse/SPARK-3910
>             Project: Spark
>          Issue Type: Sub-task
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Mac OS X 10.9.5, Python 2.6.8, Java 1.8.0_20, 
> Jinja2==2.7.3, MarkupSafe==0.23, Pygments==1.6, Sphinx==1.2.3, 
> argparse==1.2.1, docutils==0.12, flake8==2.2.3, mccabe==0.2.1, numpy==1.9.0, 
> pep8==1.5.7, psutil==2.1.3, pyflake8==0.1.9, pyflakes==0.8.1, 
> unittest2==0.5.1, wsgiref==0.1.2
>            Reporter: Tomohiko K.
>              Labels: pyspark, testing
>
> In ./python/run-tests script, we run the doctests in 
> ./pyspark/mllib/classification.py.
> The output is as following:
> {noformat}
> $ ./python/run-tests
> ...
> Running test: pyspark/mllib/classification.py
> Traceback (most recent call last):
>   File "pyspark/mllib/classification.py", line 20, in <module>
>     import numpy
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/__init__.py",
>  line 170, in <module>
>     from . import add_newdocs
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/add_newdocs.py",
>  line 13, in <module>
>     from numpy.lib import add_newdoc
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/__init__.py",
>  line 8, in <module>
>     from .type_check import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/lib/type_check.py",
>  line 11, in <module>
>     import numpy.core.numeric as _nx
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/core/__init__.py",
>  line 46, in <module>
>     from numpy.testing import Tester
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/__init__.py",
>  line 13, in <module>
>     from .utils import *
>   File 
> "/Users/tomohiko/.virtualenvs/pyspark_py26/lib/python2.6/site-packages/numpy/testing/utils.py",
>  line 15, in <module>
>     from tempfile import mkdtemp
>   File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/tempfile.py",
>  line 34, in <module>
>     from random import Random as _Random
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/mllib/random.py", 
> line 24, in <module>
>     from pyspark.rdd import RDD
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/__init__.py", line 
> 51, in <module>
>     from pyspark.context import SparkContext
>   File "/Users/tomohiko/MyRepos/Scala/spark/python/pyspark/context.py", line 
> 22, in <module>
>     from tempfile import NamedTemporaryFile
> ImportError: cannot import name NamedTemporaryFile
>         0.07 real         0.04 user         0.02 sys
> Had test failures; see logs.
> {noformat}
> The problem is a cyclic import of tempfile module.
> The cause of it is that pyspark.mllib.random module exists in the directory 
> where pyspark.mllib.classification module exists.
> classification module imports numpy module, and then numpy module imports 
> tempfile module from its inside.
> Now the first entry sys.path is the directory "./python/pyspark/mllib" (where 
> the executed file "classification.py" exists), so tempfile module imports 
> pyspark.mllib.random module (not the standard library "random" module).
> Finally, import chains reach tempfile again, then a cyclic import is formed.
> Summary: classification → numpy → tempfile → pyspark.mllib.random → tempfile 
> → (cyclic import!!)
> Furthermore, stat module is in a standard library, and pyspark.mllib.stat 
> module exists. This also may be troublesome.
> commit: 0e8203f4fb721158fb27897680da476174d24c4b
> A fundamental solution is to avoid using module names used by standard 
> libraries (currently "random" and "stat").
> A difficulty of this solution is to rename pyspark.mllib.random and 
> pyspark.mllib.stat, which may be already used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to