[
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng updated SPARK-31497:
----------------------------------
Target Version/s: 3.0.0
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot
> save and load model
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Affects Versions: 2.4.5
> Reporter: Weichen Xu
> Assignee: Weichen Xu
> Priority: Major
> Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel,
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer,
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
> estimatorParamMaps=paramGrid,
> evaluator=BinaryClassificationEvaluator(),
> numFolds=2) # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]