[
https://issues.apache.org/jira/browse/SPARK-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley closed SPARK-6362.
------------------------------------
Resolution: Fixed
Fix Version/s: 1.3.0
I'm going to close this since it appears to be fixed (based on running it
locally just now on master).
> Broken pipe error when training a RandomForest on a union of two RDDs
> ---------------------------------------------------------------------
>
> Key: SPARK-6362
> URL: https://issues.apache.org/jira/browse/SPARK-6362
> Project: Spark
> Issue Type: Bug
> Components: MLlib, PySpark
> Affects Versions: 1.2.0
> Environment: Kubuntu 14.04, local driver
> Reporter: Pavel Laskov
> Priority: Minor
> Fix For: 1.3.0
>
>
> Training a RandomForest classifier on a dataset obtained as a union of two
> RDDs throws a broken pipe error:
> Traceback (most recent call last):
> File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 162, in
> manager
> code = worker(sock)
> File "/home/laskov/code/spark-1.2.1/python/pyspark/daemon.py", line 64, in
> worker
> outfile.flush()
> IOError: [Errno 32] Broken pipe
> Despite an error the job runs to completion.
> The following code reproduces the error:
> from pyspark.context import SparkContext
> from pyspark.mllib.rand import RandomRDDs
> from pyspark.mllib.tree import RandomForest
> from pyspark.mllib.linalg import DenseVector
> from pyspark.mllib.regression import LabeledPoint
> import random
> if __name__ == "__main__":
> sc = SparkContext(appName="Union bug test")
> data1 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
> data1 = data1.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> data2 = RandomRDDs.normalVectorRDD(sc,numRows=10000,numCols=200)
> data2 = data2.map(lambda x: LabeledPoint(random.randint(0,1),\
> DenseVector(x)))
> training_data = data1.union(data2)
> #training_data = training_data.repartition(2)
> model = RandomForest.trainClassifier(training_data, numClasses=2,
> categoricalFeaturesInfo={},
> numTrees=50, maxDepth=30)
> Interestingly, re-partitioning the data after the union operation rectifies
> the problem (uncomment the line before training in the code above).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]