[ 
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358504#comment-14358504
 ] 

Pavel Laskov commented on SPARK-6282:
-------------------------------------

Hi Sven and Joseph,

Thanks for a quick reply to my bug report. I still think the problem is 
somewhere in Spark. Here is an autonomous code snippet which triggers the error 
on my system. Uncommenting any of the imports marked with ### causes a crash. 
Switching to "import random / random.random()" fixes the problems. None of the 
functions imported in the ### lines is used in the test code. Looks like a very 
obscure dependency of some mllib components on _winreg? 

<begin code>
from random import random
# import random
from pyspark.context import SparkContext
from pyspark.mllib.rand import RandomRDDs
### Any of these imports causes the crash
### from pyspark.mllib.tree import RandomForest, DecisionTreeModel
### from pyspark.mllib.linalg import SparseVector
### from pyspark.mllib.regression import LabeledPoint

if __name__ == "__main__":
 
    sc = SparkContext(appName="Random() bug test")
    data = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
    d = data.map(lambda x: (random(), x))
    print d.first()
<end code>

Here is the full trace of the error:

Traceback (most recent call last):
  File "/home/laskov/research/pe-class/python/src/experiments/test_random.py", 
line 16, in <module>
    print d.first()
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1139, in 
first
    rs = self.take(1)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take
    totalParts = self._jrdd.partitions().size()
  File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 2115, in 
_jrdd
    pickled_command = ser.dumps(command)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/serializers.py", line 406, 
in dumps
    return cloudpickle.dumps(obj, 2)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 816, 
in dumps
    cp.dump(obj)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 133, 
in dump
    return pickle.Pickler.dump(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, 
in save_function
    self.save_function_tuple(obj, [themodule])
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, 
in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends
    save(x)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, 
in save_function
    self.save_function_tuple(obj, [themodule])
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, 
in save_function_tuple
    save((code, closure, base_globals))
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple
    save(element)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/lib/python2.7/pickle.py", line 600, in save_list
    self._batch_appends(iter(obj))
  File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends
    save(tmp[0])
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 249, 
in save_function
    self.save_function_tuple(obj, modList)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 309, 
in save_function_tuple
    save(f_globals)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 174, 
in save_dict
    pickle.Pickler.save_dict(self, obj)
  File "/usr/lib/python2.7/pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "/usr/lib/python2.7/pickle.py", line 686, in _batch_setitems
    save(v)
  File "/usr/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 399, 
in save_builtin_function
    return self.save_function(obj)
  File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 209, 
in save_function
    modname = pickle.whichmodule(obj, name)
  File "/usr/lib/python2.7/pickle.py", line 817, in whichmodule
    if name != '__main__' and getattr(module, funcname, None) is func:
  File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 116, in 
__getattr__
    _module = self._resolve()
  File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 105, in 
_resolve
    return _import_module(self.mod)
  File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 76, in 
_import_module
    __import__(name)
ImportError: No module named _winreg


Best regards from Munich,

---
Pavel Laskov
Principal Engineer, Security Product Innovation Team
T: + 49 (0)89 158834-4170
E: pavel.las...@huawei.com
European Research Center, HUAWEI TECHNOLOGIES Duesseldorf GmbH 
Riessstr. 25 C-3.0G, 80992 Munich




> Strange Python import error when using random() in a lambda function
> --------------------------------------------------------------------
>
>                 Key: SPARK-6282
>                 URL: https://issues.apache.org/jira/browse/SPARK-6282
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.0
>         Environment: Kubuntu 14.04, Python 2.7.6
>            Reporter: Pavel Laskov
>            Priority: Minor
>
> Consider the exemplary Python code below:
>    from random import random
>    from pyspark.context import SparkContext
>    from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__": 
>     sc = SparkContext(appName="Random() bug test")
>     data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
>     #data = sc.parallelize([1, 2, 3, 4, 5], 2)
>     d = data.map(lambda x: (random(), x))
>     print d.first()
> Data is read from a large CSV file. Running this code results in a Python 
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no 
> error occurs. Also no error occurs, for both kinds of import statements, for 
> a small artificial data set like the one shown in a commented line.  
> The full error trace, the source code of csv reading code (function 
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original 
> dataset is about 8M large) can be provided. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to