Gregoire Leleu created ARROW-15992:
--------------------------------------
Summary: [R] csv file encoding working for one file, but not a
folder of files
Key: ARROW-15992
URL: https://issues.apache.org/jira/browse/ARROW-15992
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Gregoire Leleu
Attachments: Test1.txt
The encoding options are passed when a single file is read with
read_delim_arrow, but not when opening a folder with open_dataset.
read_delim_arrow creates a reader using CsvTableReader$create (which is what is
tested in the package's tests).
open_dataset creates a factory and I'm unable to follow what happens when
$Finish() is called.
Also, the documentation ("CsvReadOptions" page) lists the "encoding" option
under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
{code:r}
library(dplyr)
library(arrow)
# Opens one file just fine:
one_file <- arrow::read_delim_arrow(
"test/Test1.txt",
as_data_frame = FALSE,
delim = ";",
read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_file)
# Can't open the folder that has "Test1.txt" properly, results in Column2 being
typed as binary
one_folder <- arrow::open_dataset(
"test",
delim = ";",
read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
)
collect(one_folder)
# Even when specify the schema
one_folder_w_schema <- arrow::open_dataset(
"test",
schema = Schema$create(Column1 = string(), Column2 = string()),
format = FileFormat$create("text", skip_rows = 1L, delimiter = ";",
column_names = c("Column1", "Column2"),
read_options = CsvReadOptions$create(encoding =
"ISO-8859-1"))
)
collect(one_folder_w_schema) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)