[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working on LR but not on RF

Shivam Verma (JIRA) Mon, 13 Jul 2015 03:17:21 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shivam Verma updated SPARK-9011:
--------------------------------
    Description: 
Hi,

I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
(Random Forest) classifier to classify a small dataset using the 
pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
(Logistic Regression) but not on RF)

Bug:
There is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") 
interprets the 'rawPredict' column - with LR, the rawPredictionCol is expected 
to contain vectors, whereas with RF, the prediction column contains doubles. 

Suggested Resolution: Either enable BinaryClassificationEvaluator to work with 
doubles, or let RF output a column rawPredictions containing the probability 
vectors (with probability of 1 assigned to predicted label, and 0 assigned to 
the rest).

Detailed Observation:
While running grid search on an RF classifier to classify a small dataset using 
the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
CrossValidator classes. I get the following error when I try passing a 
DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for [cross 
validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
 I also pass the DF through a StringIndexer transformation for the RF:
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", 
"label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}

Note that the above dataset *works* on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.


  was:
Hi

I'm a beginner to Spark, and am trying to run grid search on an RF classifier 
to classify a small dataset using the pyspark.ml.tuning module, specifically 
the ParamGridBuilder and CrossValidator classes. I get the following error when 
I try passing a DataFrame of Features-Labels to CrossValidator:
{noformat}
Py4JJavaError: An error occurred while calling o1464.evaluate.
: java.lang.IllegalArgumentException: requirement failed: Column rawPrediction 
must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef but was actually 
DoubleType.
{noformat}
I tried the following code, using the dataset given in Spark's CV documentation 
for [cross 
validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
 I also pass the DF through a StringIndexer transformation for the RF:
 
{noformat}
dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", 
"label"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(dataset)
dataset2 = si_model.transform(dataset)
keep = [dataset2.features, dataset2.indexed]
dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
rf = 
RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
 maxDepth=7)
grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset3)
{noformat}

Note that the above dataset *works* on logistic regression. I have also tried a 
larger dataset with sparse vectors as features (which I was originally trying 
to fit) but received the same error on RF.

My guess is that there is an issue with how BinaryClassificationEvaluator(self, 
rawPredictionCol="rawPrediction", labelCol="label", metricName="areaUnderROC") 
interprets the 'rawPredict' column - with LR, the rawPredictionCol is a 
list/vector, whereas with RF, the prediction column is a double. 

Is it an issue with the evaluator? Is there a workaround?


        Summary: Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent 
--> Grid search working on LR but not on RF  (was: Issue with running 
CrossValidator with RandomForestClassifier on dataset)

> Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search 
> working on LR but not on RF
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-9011
>                 URL: https://issues.apache.org/jira/browse/SPARK-9011
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib, PySpark
>    Affects Versions: 1.4.0
>         Environment: Spark 1.4.0 standalone on top of Hadoop 2.3 on single 
> node running CentOS
>            Reporter: Shivam Verma
>            Priority: Critical
>              Labels: cross-validation, ml, mllib, pyspark, randomforest, 
> tuning
>
> Hi,
> I ran into this bug while using pyspark.ml.tuning.CrossValidator on an RF 
> (Random Forest) classifier to classify a small dataset using the 
> pyspark.ml.tuning module. (This is a bug because CrossValidator works on LR 
> (Logistic Regression) but not on RF)
> Bug:
> There is an issue with how BinaryClassificationEvaluator(self, 
> rawPredictionCol="rawPrediction", labelCol="label", 
> metricName="areaUnderROC") interprets the 'rawPredict' column - with LR, the 
> rawPredictionCol is expected to contain vectors, whereas with RF, the 
> prediction column contains doubles. 
> Suggested Resolution: Either enable BinaryClassificationEvaluator to work 
> with doubles, or let RF output a column rawPredictions containing the 
> probability vectors (with probability of 1 assigned to predicted label, and 0 
> assigned to the rest).
> Detailed Observation:
> While running grid search on an RF classifier to classify a small dataset 
> using the pyspark.ml.tuning module, specifically the ParamGridBuilder and 
> CrossValidator classes. I get the following error when I try passing a 
> DataFrame of Features-Labels to CrossValidator:
> {noformat}
> Py4JJavaError: An error occurred while calling o1464.evaluate.
> : java.lang.IllegalArgumentException: requirement failed: Column 
> rawPrediction must be of type org.apache.spark.mllib.linalg.VectorUDT@1eef 
> but was actually DoubleType.
> {noformat}
> I tried the following code, using the dataset given in Spark's CV 
> documentation for [cross 
> validator|https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator].
>  I also pass the DF through a StringIndexer transformation for the RF:
>  
> {noformat}
> dataset = sqlContext.createDataFrame([(Vectors.dense([0.0]), 
> 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 
> 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 
> 10,["features", "label"])
> stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
> si_model = stringIndexer.fit(dataset)
> dataset2 = si_model.transform(dataset)
> keep = [dataset2.features, dataset2.indexed]
> dataset3 = dataset2.select(*keep).withColumnRenamed('indexed','label')
> rf = 
> RandomForestClassifier(predictionCol="rawPrediction",featuresCol="features",numTrees=5,
>  maxDepth=7)
> grid = ParamGridBuilder().addGrid(rf.maxDepth, [4,5,6]).build()
> evaluator = BinaryClassificationEvaluator()
> cv = CrossValidator(estimator=rf, estimatorParamMaps=grid, 
> evaluator=evaluator)
> cvModel = cv.fit(dataset3)
> {noformat}
> Note that the above dataset *works* on logistic regression. I have also tried 
> a larger dataset with sparse vectors as features (which I was originally 
> trying to fit) but received the same error on RF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-9011) Spark 1.4.0| Spark.ML Classifier Output Formats Inconsistent --> Grid search working on LR but not on RF

Reply via email to