[
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090535#comment-15090535
]
Wojciech Jurczyk commented on SPARK-12711:
------------------------------------------
[~josephkb]Is there any particular reason why StopWordsRemover is not a
UnaryTransformer? As the docs say, the UnaryTransformer is an "Abstract class
for transformers that take one input column, apply transformation, and output
the result as a new column." which is the case. Moreover, UnaryTransformer
implementation checks whether the output column already exists or not. Then,
Making StopWordsRemover a UnaryTransformer would solve the issue. Talking about
UnaryTransformers candidates, I think StringIndexer is a similar case (and
probably, there are other Transformers that could be UnaryTransformers). It
doesn't check whether the output column exists in the input DataFrame (it has
the same flaw). Making StringIndexer a UnaryTransformer would solve the flaw,
too. What do you think?
> ML StopWordsRemover does not protect itself from column name duplication
> ------------------------------------------------------------------------
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
> Issue Type: Bug
> Components: ML, MLlib
> Affects Versions: 1.6.0
> Reporter: Grzegorz Chilkiewicz
> Priority: Trivial
> Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers&estimators and I
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g.
> PCA.transformSchema, StandardScaler.transformSchema,
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug? If yes - I am willing to prepare an
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]