alamb commented on issue #2109:
URL:
https://github.com/apache/arrow-datafusion/issues/2109#issuecomment-1083477675
I think something in the DF 7.0 line made the number of lines used to infer
the schema configurable, and the default changed to use "the whole file".
Thus, in 7.0 the datafusion-cli appears to be parsing the entire CSV file to
do schema inference.
When I applied the following diff, the time went from **131.012 seconds**
locally to **0.076 seconds**.
```diff
(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$ git diff
diff --git a/datafusion/core/src/datasource/file_format/csv.rs
b/datafusion/core/src/datasource/file_format/csv.rs
index 29ca84a12..c0a6307e8 100644
--- a/datafusion/core/src/datasource/file_format/csv.rs
+++ b/datafusion/core/src/datasource/file_format/csv.rs
@@ -95,7 +95,7 @@ impl FileFormat for CsvFormat {
async fn infer_schema(&self, mut readers: ObjectReaderStream) ->
Result<SchemaRef> {
let mut schemas = vec![];
- let mut records_to_read =
self.schema_infer_max_rec.unwrap_or(std::usize::MAX);
+ let mut records_to_read = self.schema_infer_max_rec.unwrap_or(1000);
while let Some(obj_reader) = readers.next().await {
let mut reader = obj_reader?.sync_reader()?;
(arrow_dev) alamb@MacBook-Pro-2:~/Software/arrow-datafusion$
```
I suggest we change the default value of `schema_infer_max_rec` to something
sensible like 100 or 1000. I think it is exceedingly rare to need to use more
than this.
FYI @jychen7 if you are looking for good candidates for changes to backport
for a 7.1 type release, this would be one :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]