westonpace commented on PR #14663: URL: https://github.com/apache/arrow/pull/14663#issuecomment-1322608047
> It's a little bit unclear to me what happens type-wise when csv column names are duplicated and types of duplicated columns don't match. Do we currently raise or promote? This should be fully configurable by the schema evolution strategy. * The user will ask for a column in the dataset schema and we will know the name of that column * The CSV reader will supply its list of column names (which will contain the duplicate) * The evolution strategy will be given that list (the inspected fragment), the dataset schema, and the requested columns in the dataset schema and must tell us which columns to load from the CSV. I think the default behavior picks one arbitrarily and tries to load it. If the type for that column in the file doesn't match the type for the requested column in the dataset schema then a runtime error should be generated. Happy to change the default to simply give an error when the fragment has duplicate column names. At this point in time I don't think we actually infer the types but we could, so that we could support a "find first column with the right name and a compatible type". In most cases, if the fragments have duplicate column names, any amount of guessing on are part isn't likely to be correct. Sounds like a good test case to have though to verify whatever it is that we do. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org