EeshanBembi commented on code in PR #17553:
URL: https://github.com/apache/datafusion/pull/17553#discussion_r2361243741
##########
datafusion/datasource-csv/src/file_format.rs:
##########
@@ -560,21 +573,28 @@ impl CsvFormat {
})
.unzip();
} else {
- if fields.len() != column_type_possibilities.len() {
- return exec_err!(
- "Encountered unequal lengths between records on
CSV file whilst inferring schema. \
- Expected {} fields, found {} fields at record {}",
- column_type_possibilities.len(),
- fields.len(),
- record_number + 1
- );
+ // Handle files with different numbers of columns by extending
the schema
+ if fields.len() > column_type_possibilities.len() {
+ // New columns found - extend our tracking structures
+ for field in
fields.iter().skip(column_type_possibilities.len()) {
+ column_names.push(field.name().clone());
+ let mut possibilities = HashSet::new();
+ if records_read > 0 {
+ possibilities.insert(field.data_type().clone());
+ }
+ column_type_possibilities.push(possibilities);
+ }
+ }
Review Comment:
The current implementation performs positional union (not union by name).
Files must have columns in the same order, with later files potentially adding
new columns at the end. This is consistent with CSV format which doesn't have
inherent column naming, column names come from headers and are positional.
Union by name would require a different approach and is not implemented in this
PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]