[GitHub] [spark] izchen opened a new pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

GitBox Thu, 24 Sep 2020 03:09:06 -0700


izchen opened a new pull request #29862:
URL: https://github.com/apache/spark/pull/29862



   ### What changes were proposed in this pull request?
   In [SPARK-16896](https://issues.apache.org/jira/browse/SPARK-16896), 
generate some new column headers to replace duplicate column headers or empty 
string column headers in the CSV DataSource.
   
   In this PR, when the newly generated column header is duplicated with the 
existing column header, a new column header is generated again using the method 
in SPARK-16896.
   
   
   ### Why are the changes needed?
   When the CSV data source has duplicate column headers, Spark will generate 
some new column headers based on the original column headers with the index as 
a suffix.
   
   When the newly generated column header is duplicated with the existing 
column header, Spark will throw an exception message that is difficult for 
users to understand.
   
   For example, the CSV column header is `a,a,a,a1`.
   
   > AnalysisException: Found duplicate column(s) in the data schema: a1
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added a unit test case


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] izchen opened a new pull request #29862: [SPARK-32956][SQL] Ensure that the generated and existing headers are not duplicated in CSV DataSource

Reply via email to