[ https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14358504#comment-14358504 ]
Pavel Laskov commented on SPARK-6282: ------------------------------------- Hi Sven and Joseph, Thanks for a quick reply to my bug report. I still think the problem is somewhere in Spark. Here is an autonomous code snippet which triggers the error on my system. Uncommenting any of the imports marked with ### causes a crash. Switching to "import random / random.random()" fixes the problems. None of the functions imported in the ### lines is used in the test code. Looks like a very obscure dependency of some mllib components on _winreg? <begin code> from random import random # import random from pyspark.context import SparkContext from pyspark.mllib.rand import RandomRDDs ### Any of these imports causes the crash ### from pyspark.mllib.tree import RandomForest, DecisionTreeModel ### from pyspark.mllib.linalg import SparseVector ### from pyspark.mllib.regression import LabeledPoint if __name__ == "__main__": sc = SparkContext(appName="Random() bug test") data = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200) d = data.map(lambda x: (random(), x)) print d.first() <end code> Here is the full trace of the error: Traceback (most recent call last): File "/home/laskov/research/pe-class/python/src/experiments/test_random.py", line 16, in <module> print d.first() File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1139, in first rs = self.take(1) File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 1091, in take totalParts = self._jrdd.partitions().size() File "/home/laskov/code/spark-1.2.1/python/pyspark/rdd.py", line 2115, in _jrdd pickled_command = ser.dumps(command) File "/home/laskov/code/spark-1.2.1/python/pyspark/serializers.py", line 406, in dumps return cloudpickle.dumps(obj, 2) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 816, in dumps cp.dump(obj) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 133, in dump return pickle.Pickler.dump(self, obj) File "/usr/lib/python2.7/pickle.py", line 224, in dump self.save(obj) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 562, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, in save_function self.save_function_tuple(obj, [themodule]) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, in save_function_tuple save((code, closure, base_globals)) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 600, in save_list self._batch_appends(iter(obj)) File "/usr/lib/python2.7/pickle.py", line 633, in _batch_appends save(x) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 254, in save_function self.save_function_tuple(obj, [themodule]) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 304, in save_function_tuple save((code, closure, base_globals)) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 548, in save_tuple save(element) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/usr/lib/python2.7/pickle.py", line 600, in save_list self._batch_appends(iter(obj)) File "/usr/lib/python2.7/pickle.py", line 636, in _batch_appends save(tmp[0]) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 249, in save_function self.save_function_tuple(obj, modList) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 309, in save_function_tuple save(f_globals) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 174, in save_dict pickle.Pickler.save_dict(self, obj) File "/usr/lib/python2.7/pickle.py", line 649, in save_dict self._batch_setitems(obj.iteritems()) File "/usr/lib/python2.7/pickle.py", line 686, in _batch_setitems save(v) File "/usr/lib/python2.7/pickle.py", line 286, in save f(self, obj) # Call unbound method with explicit self File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 399, in save_builtin_function return self.save_function(obj) File "/home/laskov/code/spark-1.2.1/python/pyspark/cloudpickle.py", line 209, in save_function modname = pickle.whichmodule(obj, name) File "/usr/lib/python2.7/pickle.py", line 817, in whichmodule if name != '__main__' and getattr(module, funcname, None) is func: File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 116, in __getattr__ _module = self._resolve() File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 105, in _resolve return _import_module(self.mod) File "/usr/lib/python2.7/dist-packages/scipy/lib/six.py", line 76, in _import_module __import__(name) ImportError: No module named _winreg Best regards from Munich, --- Pavel Laskov Principal Engineer, Security Product Innovation Team T: + 49 (0)89 158834-4170 E: pavel.las...@huawei.com European Research Center, HUAWEI TECHNOLOGIES Duesseldorf GmbH Riessstr. 25 C-3.0G, 80992 Munich > Strange Python import error when using random() in a lambda function > -------------------------------------------------------------------- > > Key: SPARK-6282 > URL: https://issues.apache.org/jira/browse/SPARK-6282 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.0 > Environment: Kubuntu 14.04, Python 2.7.6 > Reporter: Pavel Laskov > Priority: Minor > > Consider the exemplary Python code below: > from random import random > from pyspark.context import SparkContext > from xval_mllib import read_csv_file_as_list > if __name__ == "__main__": > sc = SparkContext(appName="Random() bug test") > data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv')) > #data = sc.parallelize([1, 2, 3, 4, 5], 2) > d = data.map(lambda x: (random(), x)) > print d.first() > Data is read from a large CSV file. Running this code results in a Python > import error: > ImportError: No module named _winreg > If I use 'import random' and 'random.random()' in the lambda function no > error occurs. Also no error occurs, for both kinds of import statements, for > a small artificial data set like the one shown in a commented line. > The full error trace, the source code of csv reading code (function > 'read_csv_file_as_list' is my own) as well as a sample dataset (the original > dataset is about 8M large) can be provided. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org