thisisnic commented on issue #47731: URL: https://github.com/apache/arrow/issues/47731#issuecomment-3411003024
Thanks for the question @m0byn! When reading CSV files, Arrow infers the most appropriate data type for each column in the following order: - Null - Int64 - Boolean - Date32 - Time32 (with seconds unit) - Timestamp (with seconds unit) - Timestamp (with nanoseconds unit) - Float64 - Dictionary<String> (if ConvertOptions::auto_dict_encode is true) - Dictionary<Binary> (if ConvertOptions::auto_dict_encode is true) - String - Binary As well as this, When Arrow's C++ CSV reader infers types, a column is inferred as a string if its values are valid UTF-8 sequences, and as binary if the data is not valid UTF-8. If the concern is whether these columns contain valid utf8 sequences, this may be something to handle upstream. If the concern is just reading in the data as the intended, you can manually set the data to string format by specifying the schema. More info in this book chapter here: https://arrowrbook.com/files_and_formats.html#sec-schemas There's a reprex below that shows how to both set the schema and not check if values are valid UTF8. Let me know if you have any other questions! ``` r library(arrow) tf <- tempfile(fileext = ".csv") # Two lines: header 'x' and value 'cafe' writeLines(c("x", "cafe"), tf, useBytes = TRUE) # Read raw and replace the 'e' in the data cell with 0xE9 raw <- readBin(tf, "raw", file.size(tf)) nl <- which(raw == as.raw(0x0A))[1] # first LF val_start <- nl + 1 # start of "cafe" raw[val_start + 3] <- as.raw(0xE9) # 'e' -> 0xE9 writeBin(raw, tf) # Verify bytes readLines(tf) #> [1] "x" "caf\xe9" # [1] "x" "caf\xe9" # does weird things with bytes arrow::read_delim_arrow(tf) #> x #> 1 63, 61, 66, e9 # ah, because it's binary arrow::read_delim_arrow(tf, as_data_frame = FALSE) #> Table #> 1 rows x 1 columns #> $x <binary> # now we get an error arrow::read_delim_arrow(tf, schema = schema(x = string()), skip = 1) #> Error in `arrow::read_delim_arrow()`: #> ! Invalid: In CSV column #0: Row #2: CSV conversion error to string: invalid UTF8 data #> ℹ If you have supplied a schema and your data contains a header row, you should supply the argument `skip = 1` to prevent the header being read in as data. # if we don't care about checking, we can get it to import it as-is arrow::read_delim_arrow( tf, schema = schema(x = string()), skip = 1, convert_options = csv_convert_options(check_utf8 = FALSE) ) #> x #> 1 caf\xe9 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
