[
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207604#comment-15207604
]
Apache Spark commented on SPARK-14087:
--------------------------------------
User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11906
> PySpark ML JavaModel does not properly own params after being fit
> -----------------------------------------------------------------
>
> Key: SPARK-14087
> URL: https://issues.apache.org/jira/browse/SPARK-14087
> Project: Spark
> Issue Type: Bug
> Components: ML, PySpark
> Reporter: Bryan Cutler
> Priority: Minor
> Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to
> the parent estimator's value. Before this assignment, any params defined in
> the model are copied from the object to the class in
> {{Params._copy_params()}} and assigned a different parent UID. This causes
> PySpark to think the params are not owned by the model and can lead to a
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef',
> name='outputCol', doc='output column name.') does not belong to
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add
> the shared params {{HasInputCol}} and {{HasOutputCol}} to
> {{CountVectorizerModel}}. See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
> [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label",
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
> print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which
> the Params use when they are constructed, so they all end up with the parent
> estimator UID.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]