[GitHub] [arrow] jonkeane commented on a change in pull request #12030: ARROW-9186: [R] Allow specifying CSV file encoding

GitBox Wed, 05 Jan 2022 13:01:30 -0800


jonkeane commented on a change in pull request #12030:
URL: https://github.com/apache/arrow/pull/12030#discussion_r779133978




##########
File path: r/tests/testthat/test-csv.R
##########
@@ -290,6 +292,50 @@ test_that("more informative error when reading a CSV with 
headers and schema", {
   )
 })
 
+test_that("CSV reader works on files with non-UTF-8 encoding", {
+  strings <- c("a", "\u00e9", "\U0001f4a9")
+  file_string <- paste0(
+    "col1,col2\n",
+    paste(strings, 1:30, sep = ",", collapse = "\n")
+  )
+  file_bytes_utf16 <- iconv(
+    file_string,
+    from = Encoding(file_string),
+    to = "UTF-16LE",
+    toRaw = TRUE
+  )[[1]]
+
+  tf <- tempfile()
+  on.exit(unlink(tf))
+  con <- file(tf, open = "wb")
+  writeBin(file_bytes_utf16, con)
+  close(con)
+
+  fs <- LocalFileSystem$create()
+  reader <- CsvTableReader$create(
+    fs$OpenInputStream(tf),
+    read_options = CsvReadOptions$create(encoding = "UTF-16LE")
+  )

Review comment:
       _nods_ I was wondering more if there is any sort of error / detection. 
Altering your reprex below slightly, I see now we get binary columns out. 
That's not the worst (and there's probably not a good way to detect and do 
something differently with that reliably anyway): 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   # generate a data frame with funky characters
   latin1_chars <- iconv(
     # exclude the comma and control characters
     list(as.raw(setdiff(c(38:126, 161:255), 44))),
     "latin1", "UTF-8"
   )
   
   make_text_col <- function(chars, 
                             chars_per_item_min = 1, chars_per_item_max = 20,
                             n_items = 20) {
     choices <- unlist(strsplit(chars, ""))
     text_col <- character(n_items)
     for (i in seq_along(text_col)) {
       text_col[i] <- paste0(
         sample(
           choices, 
           round(runif(1, chars_per_item_min, chars_per_item_max)), 
           replace = TRUE
         ),
         collapse = ""
       )
     }
     text_col
   }
   
   set.seed(1843)
   n_items <- 1e6
   
   df_latin1 <- data.frame(
     n = 1:n_items,
     latin1_chars = make_text_col(latin1_chars, n_items = n_items)
   )
   
   # now check the CSV reader
   library(arrow, warn.conflicts = FALSE)
   
   # make some files
   tf_latin1_utf8 <- tempfile()
   tf_latin1_latin1 <- tempfile()
   
   readr::write_csv(df_latin1, tf_latin1_utf8)
   readr::write_file(
     iconv(list(readr::read_file_raw(tf_latin1_utf8)), "UTF-8", "latin1", toRaw 
= TRUE)[[1]],
     tf_latin1_latin1
   )
   
   fs <- LocalFileSystem$create()
   reader <- CsvTableReader$create(
     fs$OpenInputStream(tf_latin1_latin1)
   )
   tibble::as_tibble(reader$Read())
   #> # A tibble: 1,000,000 × 2
   #>        n                                                           
latin1_chars
   #>    <int>                                                             
<arrw_bnr>
   #>  1     1                             be, 31, 4f, e3, d8, d4, 5c, f9, e9, 
76, cd
   #>  2     2                                                 5c, ad, bf, ed, 
62, dd
   #>  3     3             fe, 63, ec, 48, c7, 37, 45, e1, 71, 6b, 77, ca, a4, 
a6, 3b
   #>  4     4                     47, 2f, 67, cc, a9, e3, 51, b0, 38, 52, f8, 
74, f3
   #>  5     5                                                                  
   48
   #>  6     6 7c, 47, 50, f4, e5, 49, cc, e3, 65, b7, 64, 61, b7, 64, 5d, 7a, 
51, a1
   #>  7     7                                                     39, f8, f9, 
c4, 70
   #>  8     8                                                         4f, 78, 
65, b1
   #>  9     9                                                 fa, 71, 65, ff, 
ed, ca
   #> 10    10                     6f, 26, f9, b8, 69, c9, 42, 64, a8, 39, 77, 
7d, 58
   #> # … with 999,990 more rows
   ```
   
   <sup>Created on 2022-01-05 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #12030: ARROW-9186: [R] Allow specifying CSV file encoding

Reply via email to