kolulu23 opened a new issue, #13087:
URL: https://github.com/apache/datafusion/issues/13087

   ### Describe the bug
   
   CsvFormat `infer_schema` reports `UnequalLengths` error despite having 
quotes and escape in its options.
   
   This would suprise user because `SessionContext::register_csv` accepts 
`CsvReadOptions` but `infer_schema` somehow does not fully use it.
   
   ### To Reproduce
   
   For this csv file `test.csv`:
   ```csv
   c1,c2,c3,c4
   2166.105475712115,")8P~f(Je/+\",@pV<",g$vGzWhTxeZzXc!{,0
   ```
   
   Note that some columns are quoted with `"` and have escape character `\` 
inside.
   
   This test would fail:
   
   ```rust
   #[cfg(test)]
   mod test {
       use datafusion::error::DataFusionError;
       use datafusion::prelude::{CsvReadOptions, SessionContext};
   
       #[tokio::test]
       async fn infer_schema_failure() {
           let ctx = SessionContext::new();
           let r = ctx
               .register_csv(
                   "test",
                   "test.csv",
                   CsvReadOptions::new()
                       .has_header(true)
                       .quote(b'"')
                       .escape(b'\\'),
               )
               .await;
               assert!(r.is_ok());
       }
   }
   ```
   
   The error is `Encountered unequal lengths between records on CSV file whilst 
inferring schema. Expected 4 records, found 5 records`.
   
   ### Expected behavior
   
   `register_csv` should not return `Err` because `CsvReadOptions` has 
specified header, quotes and escape character. 
   
   Underlying csv reader should use this option to infer schema.
   
   ### Additional context
   
   If a schema is provided to `CsvReadOptions` and is correct to `test.csv`, 
then the test is passed and the csv table can be used. 
   
   After some debugging, I found that the creation of 
`arrow::csv::reader::Format` in  `CsvFormat::infer_schema_from_stream` does not 
use the quotes and escape settings in `CsvFormat` which is odd to me.
   
   
https://github.com/apache/datafusion/blob/f2da32b3bde851c34e9df0a2f4c174a5392f8897/datafusion/core/src/datasource/file_format/csv.rs#L440-L456
   
   I did dig further into the `arrow-csv` and `csv` crate, and the quotation 
and escaping options are all there, I think if the right option is passed to 
it, `infer_schema` would be more easy to use.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to