[GitHub] [arrow] houqp commented on a change in pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

GitBox Tue, 19 May 2020 14:17:40 -0700


houqp commented on a change in pull request #7210:
URL: https://github.com/apache/arrow/pull/7210#discussion_r427605427




##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
     /// Create a new execution plan for reading a set of CSV files
     pub fn try_new(
         path: &str,
-        schema: Arc<Schema>,
+        schema: Option<Arc<Schema>>,
         has_header: bool,
+        delimiter: Option<u8>,
         projection: Option<Vec<usize>>,
         batch_size: usize,
     ) -> Result<Self> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                let mut filenames: Vec<String> = vec![];
+                common::build_file_list(path, &mut filenames, ".csv")?;
+                if filenames.is_empty() {
+                    return Err(ExecutionError::General("No files 
found".to_string()));
+                }
+
+                let f = File::open(&filenames[0])?;

Review comment:
       yeah, there is no guarantee no matter what we do unless we read all the 
entries. Even with max_inference, we can't guarantee the remaining rows will 
confirm to the inferred schema.
   
   the way i look at this is manually specify a schema if you want correctness 
and performance. only use schema inference if you just want to get a quick and 
dirty query up and running.
   
   that said, i will try to change it to read all files instead of first one.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] houqp commented on a change in pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

Reply via email to