[I] CSVReader behavior with dataset that has duplicate column headers is confusing [datafusion]

via GitHub Thu, 10 Oct 2024 07:30:47 -0700


alex opened a new issue, #12852:
URL: https://github.com/apache/datafusion/issues/12852


   ### Describe the bug
   
   When working with a CSV file that has duplicate column headers (e.g., 
https://github.com/openssl/openssl/blob/master/test/recipes/80-test_cmp_http_data/test_connection.csv)
 the behavior is confusing.
   
   Specifically, any query against a table backed by that file produces the 
error:
   
   > Arrow error: Csv error: incorrect number of fields for line 1, expected 14 
got more than 14
   
   However, all rows in that CSV have 20 fields.
   
   Based on looking at the results of a `limit 0` query, I can see that the 
schema is effectively dropping all duplicate columns from the expected schema, 
and therefore the following rows do not have the expected number of cells.
   
   ### To Reproduce
   
   Run queries against the linked CSV file.
   
   ### Expected behavior
   
   I believe expected behavior would be to either a) automatically rename those 
columns (adding `_{n}` perhaps), or b) provide a clear error that schemas with 
duplicate column names are not supported.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] CSVReader behavior with dataset that has duplicate column headers is confusing [datafusion]

Reply via email to