Gregoire Leleu created ARROW-15992:
--------------------------------------

             Summary: [R] csv file encoding working for one file, but not a 
folder of files
                 Key: ARROW-15992
                 URL: https://issues.apache.org/jira/browse/ARROW-15992
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Gregoire Leleu
         Attachments: Test1.txt

The encoding options are passed when a single file is read with 
read_delim_arrow, but not when opening a folder with open_dataset.

read_delim_arrow creates a reader using CsvTableReader$create (which is what is 
tested in the package's tests).

open_dataset creates a factory and I'm unable to follow what happens when 
$Finish() is called.

 

Also, the documentation ("CsvReadOptions" page) lists the "encoding" option 
under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"

 
{code:r}
library(dplyr)
library(arrow)
# Opens one file just fine:
one_file <- arrow::read_delim_arrow(
  "test/Test1.txt", 
  as_data_frame = FALSE,
  delim = ";",
  read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_file)
 
# Can't open the folder that has "Test1.txt" properly, results in Column2 being 
typed as binary
one_folder <- arrow::open_dataset(
  "test", 
  delim = ";",
  read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_folder)
 
# Even when specify the schema
one_folder_w_schema <- arrow::open_dataset(
  "test", 
  schema = Schema$create(Column1 = string(), Column2 = string()),
  format = FileFormat$create("text", skip_rows = 1L, delimiter = ";", 
column_names = c("Column1", "Column2"),
                             read_options = CsvReadOptions$create(encoding = 
"ISO-8859-1"))
  
)
collect(one_folder_w_schema) {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to