Re: [I] CSVReader behavior with dataset that has duplicate column headers is confusing [datafusion]

via GitHub Mon, 11 Nov 2024 18:41:06 -0800


Rafferty97 commented on issue #12852:
URL: https://github.com/apache/datafusion/issues/12852#issuecomment-2469474930


   Having thought about it some more, I think the use of `Schema::try_merge` is 
actually incorrect for CSV files, because the CSV reading process assumes that 
the fields in the `Schema` are in the same order as they appear in the file.
   
   So, if two CSVs are read in with the same columns but out of order, this 
will cause data to appear in the wrong columns. This might error out if there 
is a mismatch in types, but could also just silently return bogus data.
   
   My intuition is that the code needs to be changed to merge CSV schemas based 
on field index not field name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] CSVReader behavior with dataset that has duplicate column headers is confusing [datafusion]

Reply via email to