kolulu23 opened a new issue, #13087: URL: https://github.com/apache/datafusion/issues/13087
### Describe the bug CsvFormat `infer_schema` reports `UnequalLengths` error despite having quotes and escape in its options. This would suprise user because `SessionContext::register_csv` accepts `CsvReadOptions` but `infer_schema` somehow does not fully use it. ### To Reproduce For this csv file `test.csv`: ```csv c1,c2,c3,c4 2166.105475712115,")8P~f(Je/+\",@pV<",g$vGzWhTxeZzXc!{,0 ``` Note that some columns are quoted with `"` and have escape character `\` inside. This test would fail: ```rust #[cfg(test)] mod test { use datafusion::error::DataFusionError; use datafusion::prelude::{CsvReadOptions, SessionContext}; #[tokio::test] async fn infer_schema_failure() { let ctx = SessionContext::new(); let r = ctx .register_csv( "test", "test.csv", CsvReadOptions::new() .has_header(true) .quote(b'"') .escape(b'\\'), ) .await; assert!(r.is_ok()); } } ``` The error is `Encountered unequal lengths between records on CSV file whilst inferring schema. Expected 4 records, found 5 records`. ### Expected behavior `register_csv` should not return `Err` because `CsvReadOptions` has specified header, quotes and escape character. Underlying csv reader should use this option to infer schema. ### Additional context If a schema is provided to `CsvReadOptions` and is correct to `test.csv`, then the test is passed and the csv table can be used. After some debugging, I found that the creation of `arrow::csv::reader::Format` in `CsvFormat::infer_schema_from_stream` does not use the quotes and escape settings in `CsvFormat` which is odd to me. https://github.com/apache/datafusion/blob/f2da32b3bde851c34e9df0a2f4c174a5392f8897/datafusion/core/src/datasource/file_format/csv.rs#L440-L456 I did dig further into the `arrow-csv` and `csv` crate, and the quotation and escaping options are all there, I think if the right option is passed to it, `infer_schema` would be more easy to use. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org