[jira] [Updated] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

Josh Rosen (JIRA) Fri, 26 Feb 2016 20:04:56 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-12874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Josh Rosen updated SPARK-12874:
-------------------------------
    Fix Version/s:     (was: 1.6.2)
                   1.6.1

> ML StringIndexer does not protect itself from column name duplication
> ---------------------------------------------------------------------
>
>                 Key: SPARK-12874
>                 URL: https://issues.apache.org/jira/browse/SPARK-12874
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.5.2, 1.6.0
>            Reporter: Wojciech Jurczyk
>            Assignee: Yu Ishikawa
>             Fix For: 1.6.1, 2.0.0
>
>
> StringIndexerModel, when performing transform() does not check the schema of 
> the input DataFrame. Because of that, it is possible to create a DataFrame 
> containing columns with duplicated names.
> This issue is similar to SPARK-12711. StringIndexer could make use of 
> transformSchema to assure that the input DataFrame schema is correct in sense 
> of the parameters' values.
> Please confirm. Then, I'll prepare a PR to resolve the bug.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L147



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-12874) ML StringIndexer does not protect itself from column name duplication

Reply via email to