tustvold commented on code in PR #4818:
URL: https://github.com/apache/arrow-datafusion/pull/4818#discussion_r1070588199


##########
datafusion/core/src/physical_plan/file_format/csv.rs:
##########
@@ -444,24 +444,11 @@ mod tests {
         assert_eq!(14, csv.schema().fields().len());
 
         let mut it = csv.execute(0, task_ctx)?;
-        let batch = it.next().await.unwrap()?;
-        assert_eq!(14, batch.num_columns());
-        assert_eq!(5, batch.num_rows());
-
-        let expected = vec![
-            
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
-            "| c1 | c2 | c3  | c4     | c5         | c6                   | c7 
 | c8    | c9         | c10                  | c11         | c12                
 | c13                            | missing_col |",
-            
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
-            "| c  | 2  | 1   | 18109  | 2033001162 | -6513304855495910254 | 25 
 | 43062 | 1491205016 | 5863949479783605708  | 0.110830784 | 0.9294097332465232 
 | 6WfVFBVGJSQb7FhA7E0lBwdvjfZnSW |             |",
-            "| d  | 5  | -40 | 22614  | 706441268  | -7542719935673075327 | 
155 | 14337 | 3373581039 | 11720144131976083864 | 0.69632107  | 
0.3114712539863804  | C2GT5KVyOPZpgKVl110TyZO0NcJ434 |             |",
-            "| b  | 1  | 29  | -18218 | 994303988  | 5983957848665088916  | 
204 | 9489  | 3275293996 | 14857091259186476033 | 0.53840446  | 
0.17909035118828576 | AyYVExXK6AR2qUTxNZ7qRHQOVGMLcz |             |",
-            "| a  | 1  | -85 | -15154 | 1171968280 | 1919439543497968449  | 77 
 | 52286 | 774637006  | 12101411955859039553 | 0.12285209  | 0.6864391962767343 
 | 0keZ5G8BffGwgF2RwQD59TFzMStxCB |             |",
-            "| b  | 5  | -82 | 22080  | 1824882165 | 7373730676428214987  | 
208 | 34331 | 3342719438 | 3330177516592499461  | 0.82634634  | 
0.40975383525297016 | Ig1QcuKsjHXkproePdERo2w0mYzIqd |             |",
-            
"+----+----+-----+--------+------------+----------------------+-----+-------+------------+----------------------+-------------+---------------------+--------------------------------+-------------+",
-        ];
-
-        crate::assert_batches_eq!(expected, &[batch]);
-
+        let err = it.next().await.unwrap().unwrap_err().to_string();

Review Comment:
   Yeah, that logic has never been correct, it incorrectly assumes that the CSV 
reader will pad nulls which it hasn't ever except for in the case of a prefix. 
Unlike parquet or JSON columns aren't named, and so there isn't a way to 
perform this.
   
   Consider a schema containing columns a, b, and c, if the file actually had 
schema a,c previously this would interpret column c as b 😱. Now it errors as it 
should.
   
   I would advise we merge this as is, and potentially file a follow up ticket 
to investigate this further. I think the mistake is in this test, but I'm not 
100% sure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to