thisisnic commented on issue #47731:
URL: https://github.com/apache/arrow/issues/47731#issuecomment-3411003024

   Thanks for the question @m0byn!
   
   When reading CSV files, Arrow infers the most appropriate data type for each 
column in the following order:
   
   - Null
   - Int64
   - Boolean
   - Date32
   - Time32 (with seconds unit)
   - Timestamp (with seconds unit)
   - Timestamp (with nanoseconds unit)
   - Float64
   - Dictionary<String> (if ConvertOptions::auto_dict_encode is true)
   - Dictionary<Binary> (if ConvertOptions::auto_dict_encode is true)
   - String
   - Binary
   
   As well as this, When Arrow's C++ CSV reader infers types, a column is 
inferred as a string if its values are valid UTF-8 sequences, and as binary if 
the data is not valid UTF-8.
   
   If the concern is whether these columns contain valid utf8 sequences, this 
may be something to handle upstream.
   
   If the concern is just reading in the data as the intended, you can manually 
set the data to string format by specifying the schema. More info in this book 
chapter here: https://arrowrbook.com/files_and_formats.html#sec-schemas
   
   There's a reprex below that shows how to both set the schema and not check 
if values are valid UTF8.
   
   Let me know if you have any other questions!
   
   ``` r
   library(arrow)
   
   tf <- tempfile(fileext = ".csv")
   # Two lines: header 'x' and value 'cafe'
   writeLines(c("x", "cafe"), tf, useBytes = TRUE)
   
   # Read raw and replace the 'e' in the data cell with 0xE9
   raw <- readBin(tf, "raw", file.size(tf))
   nl <- which(raw == as.raw(0x0A))[1]           # first LF
   val_start <- nl + 1                           # start of "cafe"
   raw[val_start + 3] <- as.raw(0xE9)            # 'e' -> 0xE9
   writeBin(raw, tf)
   
   # Verify bytes
   readLines(tf)
   #> [1] "x"       "caf\xe9"
   # [1] "x" "caf\xe9"
   
   # does weird things with bytes                        
   arrow::read_delim_arrow(tf)
   #>                x
   #> 1 63, 61, 66, e9
   
   # ah, because it's binary
   arrow::read_delim_arrow(tf, as_data_frame = FALSE)
   #> Table
   #> 1 rows x 1 columns
   #> $x <binary>
   
   # now we get an error 
   arrow::read_delim_arrow(tf, schema = schema(x = string()), skip = 1)
   #> Error in `arrow::read_delim_arrow()`:
   #> ! Invalid: In CSV column #0: Row #2: CSV conversion error to string: 
invalid UTF8 data
   #> ℹ If you have supplied a schema and your data contains a header row, you 
should supply the argument `skip = 1` to prevent the header being read in as 
data.
   
   # if we don't care about checking, we can get it to import it as-is
   arrow::read_delim_arrow(
     tf,
     schema = schema(x = string()),
     skip = 1,
     convert_options = csv_convert_options(check_utf8 = FALSE)
   )
   #>         x
   #> 1 caf\xe9
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to