Github user BryanCutler commented on the issue:

    https://github.com/apache/spark/pull/16770
  
    This is currently not working because of `param` issues.  In order for a 
model constructed from vocab to transform a DataFrame, it was first necessary 
to add `InputColumn` and `OutputColumn` params to the model class.  After that, 
the normal operation of fitting the model, then transforming fails because the 
CountVectorizer estimator never copies values to the CountVectorizerModel.  
This causes test failures because the column names are wrong on the transformed 
DataFrame.
    ```
    File "spark/python/pyspark/ml/feature.py", line 233, in 
__main__.CountVectorizer
    Failed example:
        model.transform(df).show(truncate=False)
    Expected:
        +-----+---------------+-------------------------+
        |label|raw            |vectors                  |
        +-----+---------------+-------------------------+
        |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
        |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
        +-----+---------------+-------------------------+
        ...
    Got:
        
+-----+---------------+-------------------------------------------------+
        |label|raw            
|CountVectorizerModel_4514bd7bded7359f0828__output|
        
+-----+---------------+-------------------------------------------------+
        |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])                        
|
        |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])                        
|
        
+-----+---------------+-------------------------------------------------+
    
    ```
    The correct way to fix this is to change `JavaEstimator._fit` in 
`wrapper.py` to include a call to `_copyValues` like
    ```
    def _fit(self, dataset):
            java_model = self._fit_java(dataset)
            model = self._create_model(java_model)
    return self._copyValues(model)
    ```
    as was done in #14653 from SPARK-10931 PySpark ML Models should contain 
Param values.
    
    I would like to take over SPARK-10931 and simplify it to just include the 
above fix to `wrapper.py` and implement it for the `CountVectorizer` class.  
The remaining classes can be implemented in pieces as follow on tasks.  Once 
SPARK-10931, this PR should work too.  What are your thoughts @holdenk and 
@jkbradley ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to