[GitHub] [arrow-datafusion] alamb commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

GitBox Wed, 30 Mar 2022 11:31:21 -0700


alamb commented on issue #2109:
URL: 
https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083477675



   I think something in the DF 7.0 line made the number of lines used to infer 
the schema configurable, and the default changed to use "the whole file".
   
   Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to 
do schema inference.  
   
   When I applied the following diff, the time went from **131.012 seconds** 
locally to **0.076 seconds**.
   
   ```diff
   (arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
   diff --git a/datafusion/core/src/datasource/file_format/csv.rs 
b/datafusion/core/src/datasource/file_format/csv.rs
   index 29ca84a12..c0a6307e8 100644
   --- a/datafusion/core/src/datasource/file_format/csv.rs
   +++ b/datafusion/core/src/datasource/file_format/csv.rs
   @@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
        async fn infer_schema(&self, mut readers: ObjectReaderStream) -> 
Result<SchemaRef> {
            let mut schemas = vec![];
    
   -        let mut records_to_read = 
self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
   +        let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
    
            while let Some(obj_reader) = readers.next().await {
                let mut reader = obj_reader?.sync_reader()?;
   (arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ 
   ```
   
   I suggest we change the default value of `schema_infer_max_rec` to something 
sensible like 100 or 1000. I think it is exceedingly rare to need to use more 
than this.
   
   FYI @jychen7  if you are looking for good candidates for changes to backport 
for a 7.1 type release, this would be one :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #2109: Almost 100x slowdown on 0.7.0 with CSV file

Reply via email to