westonpace commented on PR #14663:
URL: https://github.com/apache/arrow/pull/14663#issuecomment-1322608047

   > It's a little bit unclear to me what happens type-wise when csv column 
names are duplicated and types of duplicated columns don't match. Do we 
currently raise or promote?
   
   This should be fully configurable by the schema evolution strategy.
   
   * The user will ask for a column in the dataset schema and we will know the 
name of that column
   * The CSV reader will supply its list of column names (which will contain 
the duplicate)
   * The evolution strategy will be given that list (the inspected fragment), 
the dataset schema, and the requested columns in the dataset schema and must 
tell us which columns to load from the CSV.
   
   I think the default behavior picks one arbitrarily and tries to load it.  If 
the type for that column in the file doesn't match the type for the requested 
column in the dataset schema then a runtime error should be generated.
   
   Happy to change the default to simply give an error when the fragment has 
duplicate column names.  At this point in time I don't think we actually infer 
the types but we could, so that we could support a "find first column with the 
right name and a compatible type".
   In most cases, if the fragments have duplicate column names, any amount of 
guessing on are part isn't likely to be correct.
   
   Sounds like a good test case to have though to verify whatever it is that we 
do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to