[
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15354707#comment-15354707
]
Edward Ma edited comment on SPARK-16247 at 6/29/16 7:09 AM:
------------------------------------------------------------
Thank you for your comment. Yes, it should be ml's parameter. Modified the
original and as following:
From
{noformat}
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
{noformat}
To
{noformat}
paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 20,
30)).build()
{noformat}
However, I still get error. The error is that
{noformat}
Traceback (most recent call last):
File "D:/workplace/dev/poc/pyspark/test_standalone_dataframe.py", line 203,
in <module>
cvModel = cv.fit(trainingData)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\pipeline.py", line
69, in fit
return self._fit(dataset)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\tuning.py", line
241, in _fit
metric = eva.evaluate(model.transform(validation, epm[j]))
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py",
line 69, in evaluate
return self._evaluate(dataset)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py",
line 99, in _evaluate
return self._java_obj.evaluate(dataset._jdf)
File
"D:\tool\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
line 813, in __call__
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py", line
53, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column label
must be of type DoubleType but was actually StringType.'
{noformat}
Understand that the exception is talking about it should be DoubleType but my
input is StringType. However, the model should use StringIndexer and
VerctorIndex's ouput as model input which are DoubleType and VectorType.
was (Author: mkcedward):
Thank you for your comment. Yes, it should be ml's parameter. Modified the
original and as following:
From
{noformat}
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
{noformat}
To
{noformat}
paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10, 20,
30)).build()
{noformat}
However, I still get error. The error is that
{noformat}
Traceback (most recent call last):
File "D:/workplace/dev/poc/pyspark/test_standalone_dataframe.py", line 203,
in <module>
cvModel = cv.fit(trainingData)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\pipeline.py", line
69, in fit
return self._fit(dataset)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\tuning.py", line
241, in _fit
metric = eva.evaluate(model.transform(validation, epm[j]))
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py",
line 69, in evaluate
return self._evaluate(dataset)
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\ml\evaluation.py",
line 99, in _evaluate
return self._java_obj.evaluate(dataset._jdf)
File
"D:\tool\spark-1.6.1-bin-hadoop2.6\python\lib\py4j-0.9-src.zip\py4j\java_gateway.py",
line 813, in __call__
File "D:\tool\spark-1.6.1-bin-hadoop2.6\python\pyspark\sql\utils.py", line
53, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column label
must be of type DoubleType but was actually StringType.'
{noformat}
Understand that the exception is talking about it should be DoubleType but my
input is StringType. However, the model should use StringIndexer and
VerctorIndex as output which suppose to be DoubleType and VectorType.
> Using pyspark dataframe with pipeline and cross validator
> ---------------------------------------------------------
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.6.1
> Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose
> to be existed after executing pipeline.
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit
> function and line 239, est.fit), I found that it does not execute pipeline
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg".
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> {noformat}
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label",
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg",
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel",
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer,
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(classification_model.maxDepth, (10,
> 20, 30)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid,
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]