Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

via GitHub Thu, 18 Sep 2025 14:34:28 -0700


EeshanBembi commented on code in PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#discussion_r2361243741



##########
datafusion/datasource-csv/src/file_format.rs:
##########
@@ -560,21 +573,28 @@ impl CsvFormat {
                     })
                     .unzip();
             } else {
-                if fields.len() != column_type_possibilities.len() {
-                    return exec_err!(
-                            "Encountered unequal lengths between records on 
CSV file whilst inferring schema. \
-                             Expected {} fields, found {} fields at record {}",
-                            column_type_possibilities.len(),
-                            fields.len(),
-                            record_number + 1
-                        );
+                // Handle files with different numbers of columns by extending 
the schema
+                if fields.len() > column_type_possibilities.len() {
+                    // New columns found - extend our tracking structures
+                    for field in 
fields.iter().skip(column_type_possibilities.len()) {
+                        column_names.push(field.name().clone());
+                        let mut possibilities = HashSet::new();
+                        if records_read > 0 {
+                            possibilities.insert(field.data_type().clone());
+                        }
+                        column_type_possibilities.push(possibilities);
+                    }
+                }

Review Comment:
   The current implementation performs positional union (not union by name). 
Files must have columns in the same order, with later files potentially adding 
new columns at the end. This is consistent with CSV format which doesn't have 
inherent column naming, column names come from headers and are positional. 
Union by name would require a different approach and is not implemented in this 
PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Support reading CSV files with inconsistent column counts [datafusion]

Reply via email to