Pavel Laskov created SPARK-6282:
-----------------------------------
Summary: Strange Python import error when using random() in a
lambda function
Key: SPARK-6282
URL: https://issues.apache.org/jira/browse/SPARK-6282
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.2.0
Environment: Kubuntu 14.04, Python 2.7.6
Reporter: Pavel Laskov
Priority: Minor
Consider the exemplary Python code below:
from random import random
from pyspark.context import SparkContext
from xval_mllib import read_csv_file_as_list
if __name__ == "__main__":
sc = SparkContext(appName="Random() bug test")
data = sc.parallelize(read_csv_file_as_list('data/malfease-xp.csv'))
#data = sc.parallelize([1, 2, 3, 4, 5], 2)
d = data.map(lambda x: (random(), x))
print d.first()
Data is read from a large CSV file. Running this code results in a Python
import error:
ImportError: No module named _winreg
If I use 'import random' and 'random.random()' in the lambda function no error
occurs. Also no error occurs, for both kinds of import statements, for a small
artificial data set like the one shown in a commented line.
The full error trace, the source code of csv reading code (function
'read_csv_file_as_list' is my own) as well as a sample dataset (the original
dataset is about 8M large) can be provided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]