[
https://issues.apache.org/jira/browse/SPARK-6282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363340#comment-14363340
]
Pavel Laskov commented on SPARK-6282:
-------------------------------------
Hi Davies,
Yes, I was also quite baffled that everything works on a small artificial
dataset. Here is an example that fails on my machine while still being
independent of real data I am using as well as any data-specific code on my
part:
from numpy.random import random
# import random
from pyspark.context import SparkContext
from pyspark.mllib.rand import RandomRDDs
### Any of these imports causes the crash
from pyspark.mllib.tree import RandomForest, DecisionTreeModel
### from pyspark.mllib.linalg import SparseVector
### from pyspark.mllib.regression import LabeledPoint
if __name__ == "__main__":
sc = SparkContext(appName="Random() bug test")
data = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
d = data.map(lambda x: (random(), x))
print d.first()
What breaks this code is the import of some mllib packages *even if they are
not used* in the code (you can try any imports from the ### section). Another
baffling thing is that nothing happens until some collection operation, like
'collect', 'top' or 'first'. Comment out the print statement and the error
disappears.
Best regards from Munich,
---
Pavel Laskov
Principal Engineer, Security Product Innovation Team
T: + 49 (0)89 158834-4170
E: [email protected]
European Research Center, HUAWEI TECHNOLOGIES Duesseldorf GmbH
Riessstr. 25 C-3.0G, 80992 Munich
> Strange Python import error when using random() in a lambda function
> --------------------------------------------------------------------
>
> Key: SPARK-6282
> URL: https://issues.apache.org/jira/browse/SPARK-6282
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, Python 2.7.6
> Reporter: Pavel Laskov
> Priority: Minor
>
> Consider the exemplary Python code below:
> from random import random
> from pyspark.context import SparkContext
> from xval_mllib import read_csv_file_as_list
> if __name__ == "__main__":
> sc = SparkContext(appName="Random() bug test")
> data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
> #data = sc.parallelize([1, 2, 3, 4, 5], 2)
> d = data.map(lambda x: (random(), x))
> print d.first()
> Data is read from a large CSV file. Running this code results in a Python
> import error:
> ImportError: No module named _winreg
> If I use 'import random' and 'random.random()' in the lambda function no
> error occurs. Also no error occurs, for both kinds of import statements, for
> a small artificial data set like the one shown in a commented line.
> The full error trace, the source code of csv reading code (function
> 'read_csv_file_as_list' is my own) as well as a sample dataset (the original
> dataset is about 8M large) can be provided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]