[
https://issues.apache.org/jira/browse/SPARK-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
koert kuipers closed SPARK-8817.
--------------------------------
Resolution: Not A Problem
I believe community disagrees with me and thinks its ok to have duplicate
names, so i am going to close this
> DataFrame should not allow duplicate colum names
> ------------------------------------------------
>
> Key: SPARK-8817
> URL: https://issues.apache.org/jira/browse/SPARK-8817
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.4.0
> Reporter: koert kuipers
> Priority: Minor
>
> pull 2209 (https://github.com/apache/spark/pull/2209) for SPARK-2890 disabled
> field name validation (which checks for duplicate column names) in
> StructType, in favor of throwing throwing an error in SQL query analysis.
> the problem with this is that it is not intuitive for a DataFrame to have
> duplicate column names, and not all usage of DataFrame involves SQL queries.
> by removing the check from StructType and hence from DataFrame it becomes the
> responsibility of the DSLs that are build on top of DataFrame to do these
> checks, which is more burdensome and can lead to subtle errors. i ran into
> this while writing an alternative DSL for DataFrame.
> In R duplicate columns get automatically renamed:
> > data.frame(x = c(1,2), x = c(3,4))
> x x.1
> 1 1 3
> 2 2 4
> i believe pandas does allow duplicate names, but i am not sure (never used
> it).
> maybe StructType.validateFields can do something similar to what R does and
> simply renames the dupes?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]