[jira] [Created] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

Shivam Verma (JIRA) Mon, 13 Jul 2015 02:09:07 -0700

Shivam Verma created SPARK-9011:
-----------------------------------

             Summary: Issue with running CrossValidator with 
RandomForestClassifier on dataset
                 Key: SPARK-9011
                 URL: https://issues.apache.org/jira/browse/SPARK-9011
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib, PySpark
    Affects Versions: 1.4.0
         Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
node running CentOS
            Reporter: Shivam Verma
            Priority: Critical



Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:

Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.

I tried the following code, using the dataset given in Spark's CV documentation 
for logistic regression. I also pass the DF through a StringIndexer 
transformation for the RF: 
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
 

dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", 
"label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)

Note that the above dataset works on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") 
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector, 
whereas with RF, the prediction column is a double (I tried it out with a 
single parameter). Is it an issue with the evaluator, or is there anything else 
that I'm missing?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-9011) Issue with running CrossValidator with RandomForestClassifier on dataset

Reply via email to