[GitHub] [arrow] nevi-me commented on a change in pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

GitBox Tue, 19 May 2020 14:10:40 -0700


nevi-me commented on a change in pull request #7210:
URL: https://github.com/apache/arrow/pull/7210#discussion_r427600229




##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
     /// Create a new execution plan for reading a set of CSV files
     pub fn try_new(
         path: &str,
-        schema: Arc<Schema>,
+        schema: Option<Arc<Schema>>,
         has_header: bool,
+        delimiter: Option<u8>,
         projection: Option<Vec<usize>>,
         batch_size: usize,
     ) -> Result<Self> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                let mut filenames: Vec<String> = vec![];
+                common::build_file_list(path, &mut filenames, ".csv")?;
+                if filenames.is_empty() {
+                    return Err(ExecutionError::General("No files 
found".to_string()));
+                }
+
+                let f = File::open(&filenames[0])?;

Review comment:
       What would happen if the first file is empty, but other files aren't? A 
good follow-up could be iterating through the files until an 
`Option<max_inference: usize>` is reached to ensure that the desired sample 
size is reached.
   
   I've personally encountered some CSV files where the first k records 
conformed to a schema, but things broke down thereafter, especially with files 
that aren't generated by a uniformity-preserving tool like Spark or a database.

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -762,10 +778,20 @@ mod tests {
         let mut csv = builder.build(file).unwrap();
         match csv.next() {
             Err(e) => assert_eq!(
-                "ParseError(\"Error while parsing value 4.x4 at line 4\")",
+                "ParseError(\"Error while parsing value 4.x4 for column 1 at 
line 4\")",
                 format!("{:?}", e)
             ),
             Ok(_) => panic!("should have failed"),
         }
     }
+
+    #[test]
+    fn test_infer_field_schema() {

Review comment:
       Thanks for adding this :)

##########
File path: rust/datafusion/src/execution/physical_plan/csv.rs
##########
@@ -71,15 +75,35 @@ impl CsvExec {
     /// Create a new execution plan for reading a set of CSV files
     pub fn try_new(
         path: &str,
-        schema: Arc<Schema>,
+        schema: Option<Arc<Schema>>,
         has_header: bool,
+        delimiter: Option<u8>,
         projection: Option<Vec<usize>>,
         batch_size: usize,
     ) -> Result<Self> {
+        let schema = match schema {
+            Some(s) => s,
+            None => {
+                let mut filenames: Vec<String> = vec![];
+                common::build_file_list(path, &mut filenames, ".csv")?;
+                if filenames.is_empty() {
+                    return Err(ExecutionError::General("No files 
found".to_string()));
+                }
+
+                let f = File::open(&filenames[0])?;
+                Arc::new(csv::infer_file_schema(
+                    &mut BufReader::new(f),
+                    delimiter.unwrap_or(b','),
+                    Some(30),

Review comment:
       Using a number that's too small for inference could cause issues for 
some users (esp as it's not yet configurable). How about 1000?

##########
File path: rust/arrow/src/csv/reader.rs
##########
@@ -87,19 +87,19 @@ fn infer_field_schema(string: &str) -> DataType {
 /// with `max_read_records` controlling the maximum number of records to read.
 ///
 /// If `max_read_records` is not set, the whole file is read to infer its 
schema.
-fn infer_file_schema<R: Read + Seek>(
+pub fn infer_file_schema<R: Read + Seek>(
     reader: &mut BufReader<R>,
     delimiter: u8,
     max_read_records: Option<usize>,
-    has_headers: bool,

Review comment:
       Should we change `has_headers` to `has_header` consistently on the rest 
of the `csv` submodule?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nevi-me commented on a change in pull request #7210: ARROW-8839: [Rust] [DataFusion] support CSV schema inference in logical plan

Reply via email to