paleolimbot opened a new pull request #12030:
URL: https://github.com/apache/arrow/pull/12030
This PR makes it possible to read non-utf-8-encoded CSV files as was done in
Python (ARROW-9106). I'm very open to (and would love suggestions on!) changes
in the structure, naming, and implementation, since C++ isn't my strong suit. I
opted for using R's C-level iconv because it made more sense to me than calling
back to R (where I don't know how I'd handle partial multibyte characters at
the end of a buffer).
Reprex for testing:
``` r
library(arrow, warn.conflicts = FALSE)
tf <- tempfile()
on.exit(unlink(tf))
strings <- c("a", "\u00e9", "\U0001f4a9", NA)
file_string <- paste0(
"col1,col2\n",
paste(strings, 1:400, sep = ",", collapse = "\n")
)
file_bytes_utf16 <- iconv(file_string, to = "UTF-16LE", toRaw = TRUE)[[1]]
con <- file(tf, open = "wb")
writeBin(file_bytes_utf16, con)
close(con)
fs <- LocalFileSystem$create()
reader <- CsvTableReader$create(
fs$OpenInputStream(tf),
read_options = CsvReadOptions$create(encoding = "UTF-16LE")
)
tibble::as_tibble(reader$Read())
#> # A tibble: 400 × 2
#> col1 col2
#> <chr> <int>
#> 1 a 1
#> 2 é 2
#> 3 💩 3
#> 4 NA 4
#> 5 a 5
#> 6 é 6
#> 7 💩 7
#> 8 NA 8
#> 9 a 9
#> 10 é 10
#> # … with 390 more rows
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]