[ 
https://issues.apache.org/jira/browse/SPARK-14087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15207604#comment-15207604
 ] 

Apache Spark commented on SPARK-14087:
--------------------------------------

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/11906

> PySpark ML JavaModel does not properly own params after being fit
> -----------------------------------------------------------------
>
>                 Key: SPARK-14087
>                 URL: https://issues.apache.org/jira/browse/SPARK-14087
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, PySpark
>            Reporter: Bryan Cutler
>            Priority: Minor
>         Attachments: feature.py
>
>
> When a PySpark model is created after fitting data, its UID is initialized to 
> the parent estimator's value.  Before this assignment, any params defined in 
> the model are copied from the object to the class in 
> {{Params._copy_params()}} and assigned a different parent UID.  This causes 
> PySpark to think the params are not owned by the model and can lead to a 
> {{ValueError}} raised from {{Params._shouldOwn()}}, such as:
> {noformat}
> ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
> name='outputCol', doc='output column name.') does not belong to 
> CountVectorizer_4c8e9fd539542d783e66.
> {noformat}
> I encountered this problem while working on SPARK-13967 where I tried to add 
> the shared params {{HasInputCol}} and {{HasOutputCol}} to 
> {{CountVectorizerModel}}.  See the attached file feature.py for the WIP.
> Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
> produces the error above.
> {noformat}
> sc = SparkContext(appName="count_vec_test")
> sqlContext = SQLContext(sc)
> df = sqlContext.createDataFrame(
>         [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
> "raw"])
> cv = CountVectorizer(inputCol="raw", outputCol="vectors")
> model = cv.fit(df)
> print(model.uid)
> for p in model.params:
>   print(str(p))
> model.transform(df).show(truncate=False)
> {noformat}
> output (the UIDs should match):
> {noformat}
> CountVectorizer_4c8e9fd539542d783e66
> CountVectorizerModel_4336a81ba742b2593fef__binary
> CountVectorizerModel_4336a81ba742b2593fef__inputCol
> CountVectorizerModel_4336a81ba742b2593fef__outputCol
> {noformat}
> In the Scala implementation of this, the model overrides the UID value, which 
> the Params use when they are constructed, so they all end up with the parent 
> estimator UID.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to