[
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weichen Xu updated SPARK-31497:
-------------------------------
Description:
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save
and load model.
Reproduce code run in pyspark shell:
1) Train model and save model in pyspark:
{code:python}
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel,
ParamGridBuilder
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer,
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}
2): Train crossvalidation model in scala with similar code above, and save to
'/tmp/model_cv_scala001', run following code in pyspark:
{code:python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}
was:
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save
and load model.
Reproduce code run in pyspark shell:
1) Train model and save model in pyspark:
{code:python}
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel,
ParamGridBuilder
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer,
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}
2): Train crossvalidation model in scala with similar code above, and save to
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot
> save and load model
> ----------------------------------------------------------------------------------------------
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Affects Versions: 2.4.5
> Reporter: Weichen Xu
> Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel,
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer,
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
> estimatorParamMaps=paramGrid,
> evaluator=BinaryClassificationEvaluator(),
> numFolds=2) # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]