[jira] [Commented] (SPARK-32048) PySpark: error in serializing ML pipelines with training strategy and pipeline as estimator

Hyukjin Kwon (Jira) Wed, 20 Jan 2021 05:03:09 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-32048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268561#comment-17268561
 ]


Hyukjin Kwon commented on SPARK-32048:
--------------------------------------

I think the commits you pointed out would likely be ported back as it's too 
invasive. To discuss further about backporting, let's do that in the relevant 
JIRAs in the commits you pointed out.

> PySpark: error in serializing ML pipelines with training strategy and 
> pipeline as estimator
> -------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32048
>                 URL: https://issues.apache.org/jira/browse/SPARK-32048
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>    Affects Versions: 2.4.5
>            Reporter: Marcello Leida
>            Priority: Major
>
> Hi all,
> I get the following error when serializing a pipeline with a CrossValidation 
> and/or TrainValidationSplit training strategy and an estimator of type 
> Pipeline through pyspark:
> {code:java}
> AttributeError: 'Pipeline' object has no attribute 
> '_transfer_param_map_to_java
> {code}
> In scala the serialization works without problems, so i assume the issue 
> should be in pyspark
> In case of using the LinearRegression as estimator the serialization is 
> working properly.
> I see that in the tests of CrossValidation and TrainValidatioSplit, there is 
> not a test with Pipeline as an estimator.
> I do not know if there is a workaround for this or another way to serialize 
> the pipeline, or if this is a known issue
> Code for replicating the issue:
> {code:java}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression, 
> DecisionTreeClassifier
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, 
> TrainValidationSplit
> # Prepare training documents from a list of (id, text, label) tuples.
> df = spark.createDataFrame([
>     (0, "a b c d e spark", 1.0),
>     (1, "b d", 0.0),
>     (2, "spark f g h", 1.0),
>     (3, "hadoop mapreduce", 3.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of three stages: tokenizer, 
> hashingTF, and lr.
> lr = LogisticRegression(maxIter=10, regParam=0.001)
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), 
> outputCol="features", numFeatures=1000)
> #treeClassifier = DecisionTreeClassifier()
> sub_pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> sub_pipeline2 = Pipeline(stages=[tokenizer, hashingTF])
> paramGrid = ParamGridBuilder() \
>     .addGrid(lr.regParam, [0.1, 0.01]) \
>     .build()
> pipeline_cv = CrossValidator(estimator=lr,
>                           estimatorParamMaps=paramGrid,
>                           evaluator=BinaryClassificationEvaluator(),
>                           numFolds=2)
> cvPath = "/tmp/cv"
> pipeline_cv.write().overwrite().save(cvPath)
> model = pipeline_cv.fit(sub_pipeline2.fit(df).transform(df))
> model.write().overwrite().save(cvPath)
> pipeline_cv2 = CrossValidator(estimator=sub_pipeline,
>                           estimatorParamMaps=paramGrid,
>                           evaluator=BinaryClassificationEvaluator(),
>                           numFolds=2)
> cvPath = "/tmp/cv2"
> model2 = pipeline_cv2.fit(df).bestModel
> model2.write().overwrite().save(cvPath)
> pipeline_cv2.write().overwrite().save(cvPath)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-32048) PySpark: error in serializing ML pipelines with training strategy and pipeline as estimator

Reply via email to