houqp commented on a change in pull request #7210: URL: https://github.com/apache/arrow/pull/7210#discussion_r427605427
########## File path: rust/datafusion/src/execution/physical_plan/csv.rs ########## @@ -71,15 +75,35 @@ impl CsvExec { /// Create a new execution plan for reading a set of CSV files pub fn try_new( path: &str, - schema: Arc<Schema>, + schema: Option<Arc<Schema>>, has_header: bool, + delimiter: Option<u8>, projection: Option<Vec<usize>>, batch_size: usize, ) -> Result<Self> { + let schema = match schema { + Some(s) => s, + None => { + let mut filenames: Vec<String> = vec![]; + common::build_file_list(path, &mut filenames, ".csv")?; + if filenames.is_empty() { + return Err(ExecutionError::General("No files found".to_string())); + } + + let f = File::open(&filenames[0])?; Review comment: yeah, there is no guarantee no matter what we do unless we read all the entries. Even with max_inference, we can't guarantee the remaining rows will confirm to the inferred schema. the way i look at this is manually specify a schema if you want correctness and performance. only use schema inference if you just want to get a quick and dirty query up and running. that said, i will try to change it to read all files instead of first one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org