Aseem Bansal created SPARK-16896: ------------------------------------ Summary: Loading csv with duplicate column names Key: SPARK-16896 URL: https://issues.apache.org/jira/browse/SPARK-16896 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Aseem Bansal
It would be great if the library allows us to load csv with duplicate column names. I understand that having duplicate columns in the data is odd but sometimes we get data that has duplicate columns. Getting upstream data like that can happen. We may choose to ignore them but currently there is no way to drop those as we are not able to load them at all. Currently as a pre-processing I loaded the data into R, changed the column names and then make a fixed version with which Spark Java API can work. But if talk about other options, e.g. R has read.csv which automatically takes care of such situation by appending a number to the column name. Also case sensitivity in column names can also cause problems. I mean if we have columns like ColumnName, columnName I may want to have them as separate. But the option to do this is not documented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org