alexkreidler opened a new issue, #3324:
URL: https://github.com/apache/arrow-rs/issues/3324

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   <!--
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   -->
   I'm writing some code to infer types from a bunch of different CSV files. I 
ran it on one and got this error:
   ```
   Sampled 100000 lines from ./backup/Book-0.tsv
   Error: Csv error: Encountered UTF-8 error while reading CSV file: invalid 
utf-8: invalid UTF-8 in field 4 near byte index 149
   ```
   because it contained this string `serving as the navy�s liaison`
   
   **Describe the solution you'd like**
   I'd like to be able to pass an additional field to the `ReaderOptions` 
struct parameter to `infer_reader_schema_with_csv_options`, or better yet 
`infer_reader_schema` function, and have the library silently continue on 
non-utf8 values. It could still output their schema type as utf8, or a 
`NullType`, or even better `BinaryType`.
   <!--
   A clear and concise description of what you want to happen.
   -->
   
   **Describe alternatives you've considered**
   I could handle this in my code. I imagine there may be many users with 
non-utf8 CSVs that would still like to pass the data verbatim through Apache 
Arrow.
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   Diving into the code, it looks like we'd need to use `read_byte_record` 
instead of `read_record` below. I'm not sure the extent of changes this would 
require in the `arrow-csv` crate.
   
   
https://github.com/apache/arrow-rs/blob/9e39f96b121d88b7427295bd326d14bb78d0fb39/arrow-csv/src/reader.rs#L487-L499
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to