[ https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089854#comment-15089854 ]
Joseph K. Bradley commented on SPARK-12711: ------------------------------------------- You're correct that it should prevent column name duplication. I'd recommend using SchemaUtils.appendColumn: [https://github.com/apache/spark/blob/00d9261724feb48d358679efbae6889833e893e0/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala#L54] That will be great if you can send a PR to fix this. Thanks! By the way, please don't set the target version; committers or component maintainers will set that. > ML StopWordsRemover does not protect itself from column name duplication > ------------------------------------------------------------------------ > > Key: SPARK-12711 > URL: https://issues.apache.org/jira/browse/SPARK-12711 > Project: Spark > Issue Type: Bug > Components: ML, MLlib > Affects Versions: 1.6.0 > Reporter: Grzegorz Chilkiewicz > Priority: Trivial > Labels: ml, mllib, newbie, suggestion > > At work we were 'taking a closer look' at ML transformers&estimators and I > spotted that anomally. > On first look, resolution looks simple: > Add to StopWordsRemover.transformSchema line (as is done in e.g. > PCA.transformSchema, StandardScaler.transformSchema, > OneHotEncoder.transformSchema): > {code} > require(!schema.fieldNames.contains($(outputCol)), s"Output column > ${$(outputCol)} already exists.") > {code} > Am I correct? Is that a bug? If yes - I am willing to prepare an > appropriate pull request. > Maybe a better idea is to make use of super.transformSchema in > StopWordsRemover (and possibly in all other places)? > Links to files at github, mentioned above: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139 > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org