tustvold commented on code in PR #4818:
URL: https://github.com/apache/arrow-datafusion/pull/4818#discussion_r1070588199
##########
datafusion/core/src/physical_plan/file_format/csv.rs:
##########
@@ -444,24 +444,11 @@ mod tests {
assert_eq!(14, csv.schema().fields().len());
let mut it = csv.execute(0, task_ctx)?;
- let batch = it.next().await.unwrap()?;
- assert_eq!(14, batch.num_columns());
- assert_eq!(5, batch.num_rows());
-
- let expected = vec![
-
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
- "| c1 | c2 | c3 | c4 | c5 | c6 | c7
| c8 | c9 | c10 | c11 | c12
| c13 | missing_col |",
-
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
- "| c | 2 | 1 | 18109 | 2033001162 | -6513304855495910254 | 25
| 43062 | 1491205016 | 5863949479783605708 | 0.110830784 | 0.9294097332465232
| 6WfVFBVGJSQb7FhA7E0lBwdvjfZnSW | |",
- "| d | 5 | -40 | 22614 | 706441268 | -7542719935673075327 |
155 | 14337 | 3373581039 | 11720144131976083864 | 0.69632107 |
0.3114712539863804 | C2GT5KVyOPZpgKVl110TyZO0NcJ434 | |",
- "| b | 1 | 29 | -18218 | 994303988 | 5983957848665088916 |
204 | 9489 | 3275293996 | 14857091259186476033 | 0.53840446 |
0.17909035118828576 | AyYVExXK6AR2qUTxNZ7qRHQOVGMLcz | |",
- "| a | 1 | -85 | -15154 | 1171968280 | 1919439543497968449 | 77
| 52286 | 774637006 | 12101411955859039553 | 0.12285209 | 0.6864391962767343
| 0keZ5G8BffGwgF2RwQD59TFzMStxCB | |",
- "| b | 5 | -82 | 22080 | 1824882165 | 7373730676428214987 |
208 | 34331 | 3342719438 | 3330177516592499461 | 0.82634634 |
0.40975383525297016 | Ig1QcuKsjHXkproePdERo2w0mYzIqd | |",
-
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
- ];
-
- crate::assert_batches_eq!(expected, &[batch]);
-
+ let err = it.next().await.unwrap().unwrap_err().to_string();
Review Comment:
Yeah, that logic has never been correct, it incorrectly assumes that the CSV
reader will pad nulls which it hasn't ever except for in the case of a prefix.
Unlike parquet or JSON columns aren't named, and so there isn't a way to
perform this.
Consider a schema containing columns a, b, and c, if the file actually had
schema a,c previously this would interpret column c as b 😱. Now it errors as it
should.
I would advise we merge this as is, and potentially file a follow up ticket
to investigate this further. I think the mistake is in this test, but I'm not
100% sure
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]