Pavel Laskov created SPARK-6362:
-----------------------------------
Summary: Broken pipe error when training a RandomForest on a union
of two RDDs
Key: SPARK-6362
URL: https://issues.apache.org/jira/browse/SPARK-6362
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.2.0
Environment: Kubuntu 14.04, local driver
Reporter: Pavel Laskov
Priority: Minor
Training a RandomForest classifier on a dataset obtained as a union of two RDDs
throws a broken pipe error:
Traceback (most recent call last):
File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in
manager
code = worker(sock)
File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in
worker
outfile.flush()
IOError: [Errno 32] Broken pipe
Despite an error the job runs to completion.
The following code reproduces the error:
from pyspark.context import SparkContext
from pyspark.mllib.rand import RandomRDDs
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.regression import LabeledPoint
import random
if __name__ == "__main__":
sc = SparkContext(appName="Union bug test")
data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
DenseVector(x)))
data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
DenseVector(x)))
training_data = data1.union(data2)
#training_data = training_data.repartition(2)
model = RandomForest.trainClassifier(training_data, numClasses=2,
categoricalFeaturesInfo={},
numTrees=50, maxDepth=30)
Interestingly, re-partitioning the data after the union operation rectifies the
problem (uncomment the line before training in the code above).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]