[
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shivam Verma updated SPARK-9011:
--------------------------------
Description:
Hi
I'm a beginner to Spark, and am trying to run grid search on an RF classifier
to classify a small dataset using the pyspark.ml.tuning module, specifically
the ParamGridBuilder and CrossValidator classes. I get the following error when
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation
for logistic regression. I also pass the DF through a StringIndexer
transformation for the RF:
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]),
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]),
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features",
"label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf =
RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}
Note that the above dataset works on logistic regression. I have also tried a
larger dataset with sparse vectors as features (which I was originally trying
to fit) but received the same error on RF.
My guess is that there is an issue with how BinaryClassificationEvaluator(self,
rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC")
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector,
whereas with RF, the prediction column is a double (I tried it out with a
single parameter). Is it an issue with the evaluator, or is there anything else
that I'm missing?
was:
Hi
I'm a beginner to Spark, and am trying to run grid search on an RF classifier
to classify a small dataset using the pyspark.ml.tuning module, specifically
the ParamGridBuilder and CrossValidator classes. I get the following error when
I try passing a DataFrame of Features-Labels to CrossValidator:
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually
DoubleType.
I tried the following code, using the dataset given in Spark's CV documentation
for logistic regression. I also pass the DF through a StringIndexer
transformation for the RF:
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]),
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]),
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features",
"label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf =
RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
Note that the above dataset works on logistic regression. I have also tried a
larger dataset with sparse vectors as features (which I was originally trying
to fit) but received the same error on RF.
My guess is that there is an issue with how BinaryClassificationEvaluator(self,
rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC")
receives the 'predict' column - with LR, the rawPredictionCol is a list/vector,
whereas with RF, the prediction column is a double (I tried it out with a
single parameter). Is it an issue with the evaluator, or is there anything else
that I'm missing?
> Issue with running CrossValidator with RandomForestClassifier on dataset
> ------------------------------------------------------------------------
>
> Key: SPARK-9011
> URL: https://issues.apache.org/jira/browse/SPARK-9011
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib, PySpark
> Affects Versions: 1.4.0
> Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single
> node running CentOS
> Reporter: Shivam Verma
> Priority: Critical
> Labels: cross-validation, ml, mllib, pyspark, randomforest,
> tuning
>
> Hi
> I'm a beginner to Spark, and am trying to run grid search on an RF classifier
> to classify a small dataset using the pyspark.ml.tuning module, specifically
> the ParamGridBuilder and CrossValidator classes. I get the following error
> when I try passing a DataFrame of Features-Labels to CrossValidator:
> {noformat}
> Py4JJavaError: An error occurred while calling o1464.evaluate.
> : java.lang.IllegalArgumentException: requirement failed: Column
> rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef
> but was actually DoubleType.
> {noformat}
> I tried the following code, using the dataset given in Spark's CV
> documentation for logistic regression. I also pass the DF through a
> StringIndexer transformation for the RF:
> https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator
>
> {noformat}
> dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]),
> 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]),
> 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] *
> 10,["features", "label"])
> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
> si_model = stringIndexer.fit(dataset)
> dataset2 = si_model.transform(dataset)
> keep = [dataset2.features, dataset2.indexed]
> dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
> rf =
> RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
> maxDepth=7)
> grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=rf, estimatorParamMaps=grid,
> evaluator=evaluator)
> cvModel = cv.fit(dataset3)
> {noformat}
> Note that the above dataset works on logistic regression. I have also tried a
> larger dataset with sparse vectors as features (which I was originally trying
> to fit) but received the same error on RF.
> My guess is that there is an issue with how
> BinaryClassificationEvaluator(self, rawPredictionCol="rawPrediction",
> labelCol="label", metricName="areaUnderROC") receives the 'predict' column -
> with LR, the rawPredictionCol is a list/vector, whereas with RF, the
> prediction column is a double (I tried it out with a single parameter). Is it
> an issue with the evaluator, or is there anything else that I'm missing?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]