[ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14625580#comment-14625580
 ] 

Joseph K. Bradley commented on SPARK-9011:
------------------------------------------

This isn't really a bug, but rather a missing feature.  There are a few 
solutions:
* Add more metrics to BinaryClassificationEvaluator, and possibly permit it to 
take a Double column.
* Have trees & ensembles extend Classifier [SPARK-9016]

The outputs are not inconsistent, but some classifiers can output more columns 
than others.  (This is the second bullet above.)

We're aiming for at least the 2nd fix to get into 1.5.  I hope that the first 
bullet will get in as well.

I'll close this.

> Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search 
> working on LR but not on RF
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-9011
>                 URL: https://issues.apache.org/jira/browse/SPARK-9011
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 1.4.0
>         Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
> node running CentOS
>            Reporter: Shivam Verma
>            Priority: Minor
>              Labels: cross-validation, ml, mllib, pyspark, randomforest, 
> tuning
>
> Hi,
> I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
> (Random Forest) classifier to classify a small dataset using the 
> pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
> (Logistic Regression) but not on RF)
> Bug:
> There is an issue with how BinaryClassificationEvaluator(self, 
> rawPredictionCol="rawPrediction", labelCol="label", 
> metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the 
> rawPredictionCol is expected to contain vectors, whereas with RF, the 
> prediction column contains doubles. 
> Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
> with doubles, or let RF output a column rawPredictions containing the 
> probability vectors (with probability of 1 assigned to predicted label, and 0 
> assigned to the rest).
> Detailed Observation:
> While running grid search on an RF classifier to classify a small dataset 
> using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
> CrossValidator classes. I get the following error when I try passing a 
> DataFrame of Features-Labels to CrossValidator:
> {noformat}
> Py4JJavaError: An error occurred while calling o1464.evaluate.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
> but was actually DoubleType.
> {noformat}
> I tried the following code, using the dataset given in Spark's CV 
> documentation for [cross 
> validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
>  I also pass the DF through a StringIndexer transformation for the RF:
>  
> {noformat}
> dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
> 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
> 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
> 10,["features", "label"])
> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
> si_model = stringIndexer.fit(dataset)
> dataset2 = si_model.transform(dataset)
> keep = [dataset2.features, dataset2.indexed]
> dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
> rf = 
> RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
>  maxDepth=7)
> grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset3)
> {noformat}
> Note that the above dataset *works* on logistic regression. I have also tried 
> a larger dataset with sparse vectors as features (which I was originally 
> trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to