[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

Xiangrui Meng (Jira) Sun, 26 Apr 2020 21:06:48 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xiangrui Meng updated SPARK-31497:
----------------------------------
    Target Version/s: 3.0.0

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31497
>                 URL: https://issues.apache.org/jira/browse/SPARK-31497
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.5
>            Reporter: Weichen Xu
>            Assignee: Weichen Xu
>            Priority: Major
>             Fix For: 3.0.0
>
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
>     (0, "a b c d e spark", 1.0),
>     (1, "b d", 0.0),
>     (2, "spark f g h", 1.0),
>     (3, "hadoop mapreduce", 0.0),
>     (4, "b spark who", 1.0),
>     (5, "g d a y", 0.0),
>     (6, "spark fly", 1.0),
>     (7, "was mapreduce", 0.0),
>     (8, "e spark program", 1.0),
>     (9, "a e c l", 0.0),
>     (10, "spark compile", 1.0),
>     (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
>     .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
>     .addGrid(lr.regParam, [0.1, 0.01]) \
>     .build()
> crossval = CrossValidator(estimator=pipeline,
>                           estimatorParamMaps=paramGrid,
>                           evaluator=BinaryClassificationEvaluator(),
>                           numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

Reply via email to