paleolimbot opened a new pull request #12030:
URL: https://github.com/apache/arrow/pull/12030


   This PR makes it possible to read non-utf-8-encoded CSV files as was done in 
Python (ARROW-9106). I'm very open to (and would love suggestions on!) changes 
in the structure, naming, and implementation, since C++ isn't my strong suit. I 
opted for using R's C-level iconv because it made more sense to me than calling 
back to R (where I don't know how I'd handle partial multibyte characters at 
the end of a buffer).
   
   Reprex for testing:
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   tf <- tempfile()
   on.exit(unlink(tf))
   
   strings <- c("a", "\u00e9", "\U0001f4a9", NA)
   file_string <- paste0(
     "col1,col2\n",
     paste(strings, 1:400, sep = ",", collapse = "\n")
   )
   
   file_bytes_utf16 <- iconv(file_string, to = "UTF-16LE", toRaw = TRUE)[[1]]
   
   con <- file(tf, open = "wb")
   writeBin(file_bytes_utf16, con)
   close(con)
   
   fs <- LocalFileSystem$create()
   reader <- CsvTableReader$create(
     fs$OpenInputStream(tf),
     read_options = CsvReadOptions$create(encoding = "UTF-16LE")
   )
   
   tibble::as_tibble(reader$Read())
   #> # A tibble: 400 × 2
   #>    col1   col2
   #>    <chr> <int>
   #>  1 a         1
   #>  2 é         2
   #>  3 💩        3
   #>  4 NA        4
   #>  5 a         5
   #>  6 é         6
   #>  7 💩        7
   #>  8 NA        8
   #>  9 a         9
   #> 10 é        10
   #> # … with 390 more rows
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to