[ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092686#comment-15092686
 ] 

Joseph K. Bradley commented on SPARK-12711:
-------------------------------------------

You're right that these transformers could be UnaryTransformers.  The main 
problem is that their transform() methods involve a little initialization, 
which is not supported well by the UnaryTransformer abstraction.

Relatedly, I'm starting to work on a design doc for Params for MLlib 2.0 which 
should help handle some of these issues.  Essentially, I'm working on making it 
so that these checks don't have to be implemented separately for each class and 
can be handled in a generic way by an abstraction.  So this JIRA may not be an 
issue for 2.0+.

> ML StopWordsRemover does not protect itself from column name duplication
> ------------------------------------------------------------------------
>
>                 Key: SPARK-12711
>                 URL: https://issues.apache.org/jira/browse/SPARK-12711
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 1.6.0
>            Reporter: Grzegorz Chilkiewicz
>            Priority: Trivial
>              Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers&estimators and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?    If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to