[jira] [Comment Edited] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

Edward Ma (JIRA) Wed, 29 Jun 2016 00:11:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354707#comment-15354707
 ]


Edward Ma edited comment on SPARK-16247 at 6/29/16 7:09 AM:
------------------------------------------------------------

Thank you for your comment. Yes, it should be ml's parameter. Modified the 
original and as following:

From
{noformat}
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
{noformat}
To
{noformat}
paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 20, 
30)).build()
{noformat}

However, I still get error. The error is that 
{noformat}
Traceback (most recent call last):
  File "D:/workplace/dev/poc/pyspark/test_standalone_dataframe.py", line 203, 
in <module>
    cvModel = cv.fit(trainingData)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\pipeline.py", line 
69, in fit
    return self._fit(dataset)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\tuning.py", line 
241, in _fit
    metric = eva.evaluate(model.transform(validation, epm[j]))
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py", 
line 69, in evaluate
    return self._evaluate(dataset)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py", 
line 99, in _evaluate
    return self._java_obj.evaluate(dataset._jdf)
  File 
"D:\tool\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
 line 813, in __call__
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
53, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column label 
must be of type DoubleType but was actually StringType.'
{noformat}

Understand that the exception is talking about it should be DoubleType but my 
input is StringType. However, the model should use StringIndexer and 
VerctorIndex's ouput as model input which are DoubleType and VectorType. 


was (Author: mkcedward):
Thank you for your comment. Yes, it should be ml's parameter. Modified the 
original and as following:

From
{noformat}
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
{noformat}
To
{noformat}
paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 20, 
30)).build()
{noformat}

However, I still get error. The error is that 
{noformat}
Traceback (most recent call last):
  File "D:/workplace/dev/poc/pyspark/test_standalone_dataframe.py", line 203, 
in <module>
    cvModel = cv.fit(trainingData)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\pipeline.py", line 
69, in fit
    return self._fit(dataset)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\tuning.py", line 
241, in _fit
    metric = eva.evaluate(model.transform(validation, epm[j]))
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py", 
line 69, in evaluate
    return self._evaluate(dataset)
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py", 
line 99, in _evaluate
    return self._java_obj.evaluate(dataset._jdf)
  File 
"D:\tool\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
 line 813, in __call__
  File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
53, in deco
    raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column label 
must be of type DoubleType but was actually StringType.'
{noformat}

Understand that the exception is talking about it should be DoubleType but my 
input is StringType. However, the model should use StringIndexer and 
VerctorIndex as output which suppose to be DoubleType and VectorType. 

> Using pyspark dataframe with pipeline and cross validator
> ---------------------------------------------------------
>
>                 Key: SPARK-16247
>                 URL: https://issues.apache.org/jira/browse/SPARK-16247
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.1
>            Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

Reply via email to