houqp commented on a change in pull request #7210:
URL: https://github.com/apache/arrow/pull/7210#discussion_r427605427
##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
/// Create a new execution plan for reading a set of CSV files
pub fn try_new(
path: &str,
- schema: Arc<Schema>,
+ schema: Option<Arc<Schema>>,
has_header: bool,
+ delimiter: Option<u8>,
projection: Option<Vec<usize>>,
batch_size: usize,
) -> Result<Self> {
+ let schema = match schema {
+ Some(s) => s,
+ None => {
+ let mut filenames: Vec<String> = vec![];
+ common::build_file_list(path, &mut filenames, ".csv")?;
+ if filenames.is_empty() {
+ return Err(ExecutionError::General("No files
found".to_string()));
+ }
+
+ let f = File::open(&filenames[0])?;
Review comment:
yeah, there is no guarantee no matter what we do unless we read all the
entries. Even with max_inference, we can't guarantee the remaining rows will
confirm to the inferred schema.
the way i look at this is manually specify a schema if you want correctness
and performance. only use schema inference if you just want to get a quick and
dirty query up and running.
that said, i will try to change it to read all files instead of first one.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]