Bryan Cutler created SPARK-14087:
------------------------------------

             Summary: PySpark ML JavaModel does not properly own params after 
being fit
                 Key: SPARK-14087
                 URL: https://issues.apache.org/jira/browse/SPARK-14087
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
            Reporter: Bryan Cutler
            Priority: Minor


When a PySpark model is created after fitting data, its UID is initialized to 
the parent estimator's value.  Before this assignment, any params defined in 
the model are copied from the object to the class in {{Params._copy_params()}} 
and assigned a different parent UID.  This causes PySpark to think the params 
are not owned by the model and can lead to a {{ValueError}} raised from 
{{Params._shouldOwn()}}, such as:

{noformat}
ValueError: Param Param(parent='CountVectorizerModel_4336a81ba742b2593fef', 
name='outputCol', doc='output column name.') does not belong to 
CountVectorizer_4c8e9fd539542d783e66.
{noformat}

I encountered this problem while working on SPARK-13967 where I tried to add 
the shared params {{HasInputCol}} and {{HasOutputCol}} to 
{{CountVectorizerModel}}.  See the attached file feature.py for the WIP.

Using the modified 'feature.py', this sample code shows the mixup in UIDs and 
produces the error above.

{noformat}
sc = SparkContext(appName="count_vec_test")
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame(
        [(0, ["a", "b", "c"]), (1, ["a", "b", "b", "c", "a"])], ["label", 
"raw"])
cv = CountVectorizer(inputCol="raw", outputCol="vectors")
model = cv.fit(df)
print(model.uid)
for p in model.params:
  print(str(p))
model.transform(df).show(truncate=False)
{noformat}

output (the UIDs should match):
{noformat}
CountVectorizer_4c8e9fd539542d783e66
CountVectorizerModel_4336a81ba742b2593fef__binary
CountVectorizerModel_4336a81ba742b2593fef__inputCol
CountVectorizerModel_4336a81ba742b2593fef__outputCol
{noformat}

In the Scala implementation of this, the model overrides the UID value, which 
the Params use when they are constructed, so they all end up with the parent 
estimator UID.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to