Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1628#issuecomment-50687582
I think the problem when running `/bin/pyspark
python/pyspark/mllib/linalg.py` is that `$SPARK-HOME/python/pyspark/mllib/` is
finding its way onto the path and its `random` is being imported first.
[According to the Python
docs](https://docs.python.org/2/library/sys.html#sys.path):
> As initialized upon program startup, the first item of this list,
path[0], is the directory containing the script that was used to invoke the
Python interpreter. If the script directory is not available (e.g. if the
interpreter is invoked interactively or if the script is read from standard
input), path[0] is the empty string, which directs Python to search modules in
the current directory first. Notice that the script directory is inserted
before the entries inserted as a result of PYTHONPATH.
I don't think we want this behavior in the `linalg.py` test. I seemed to
be able to fix things by just popping the first entry off of `sys.path` when
running that test:
```diff
diff --git a/python/pyspark/mllib/linalg.py b/python/pyspark/mllib/linalg.py
index 71f4ad1..ced6e34 100644
--- a/python/pyspark/mllib/linalg.py
+++ b/python/pyspark/mllib/linalg.py
@@ -255,4 +255,6 @@ def _test():
exit(-1)
if __name__ == "__main__":
+ import sys
+ sys.path = sys.path[1:]
_test()
```
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---