[ 
https://issues.apache.org/jira/browse/SPARK-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koert kuipers closed SPARK-8817.
--------------------------------
    Resolution: Not A Problem

I believe community disagrees with me and thinks its ok to have duplicate 
names, so i am going to close this 

> DataFrame should not allow duplicate colum names
> ------------------------------------------------
>
>                 Key: SPARK-8817
>                 URL: https://issues.apache.org/jira/browse/SPARK-8817
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.4.0
>            Reporter: koert kuipers
>            Priority: Minor
>
> pull 2209 (https://github.com/apache/spark/pull/2209) for SPARK-2890 disabled 
> field name validation (which checks for duplicate column names) in 
> StructType, in favor of throwing throwing an error in SQL query analysis.
> the problem with this is that it is not intuitive for a DataFrame to have 
> duplicate column names, and not all usage of DataFrame involves SQL queries.
> by removing the check from StructType and hence from DataFrame it becomes the 
> responsibility of the DSLs that are build on top of DataFrame to do these 
> checks, which is more burdensome and can lead to subtle errors. i ran into 
> this while writing an alternative DSL for DataFrame.
> In R duplicate columns get automatically renamed:
> > data.frame(x = c(1,2), x = c(3,4))
>   x x.1
> 1 1   3
> 2 2   4
> i believe pandas does allow duplicate names, but i am not sure (never used 
> it).
> maybe StructType.validateFields can do something similar to what R does and 
> simply renames the dupes?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to