[
https://issues.apache.org/jira/browse/ARROW-15992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Nicola Crane updated ARROW-15992:
---------------------------------
Parent: ARROW-18181
Issue Type: Sub-task (was: Bug)
> [R] csv file encoding working for one file, but not a folder of files
> ---------------------------------------------------------------------
>
> Key: ARROW-15992
> URL: https://issues.apache.org/jira/browse/ARROW-15992
> Project: Apache Arrow
> Issue Type: Sub-task
> Components: R
> Reporter: Gregoire Leleu
> Priority: Major
> Attachments: Test1.txt
>
>
> The encoding options are passed when a single file is read with
> read_delim_arrow, but not when opening a folder with open_dataset.
> read_delim_arrow creates a reader using CsvTableReader$create (which is what
> is tested in the package's tests).
> open_dataset creates a factory and I'm unable to follow what happens when
> $Finish() is called.
>
> Also, the documentation ("CsvReadOptions" page) lists the "encoding" option
> under "CsvConvertOptions$create()" instead of "CsvReadOptions$create()"
>
> {code:r}
> library(dplyr)
> library(arrow)
> # Opens one file just fine:
> one_file <- arrow::read_delim_arrow(
> "test/Test1.txt",
> as_data_frame = FALSE,
> delim = ";",
> read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_file)
>
> # Can't open the folder that has "Test1.txt" properly, results in Column2
> being typed as binary
> one_folder <- arrow::open_dataset(
> "test",
> delim = ";",
> read_options = CsvReadOptions$create(encoding = "ISO-8859-1")
> )
> collect(one_folder)
>
> # Even when specify the schema
> one_folder_w_schema <- arrow::open_dataset(
> "test",
> schema = Schema$create(Column1 = string(), Column2 = string()),
> format = FileFormat$create("text", skip_rows = 1L, delimiter = ";",
> column_names = c("Column1", "Column2"),
> read_options = CsvReadOptions$create(encoding =
> "ISO-8859-1"))
>
> )
> collect(one_folder_w_schema) {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)