[
https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359591#comment-17359591
]
Jonathan Keane commented on ARROW-12059:
----------------------------------------
Working on an independent task I ran into this (and followed the issues to make
sure we've got it covered).
I'm not totally sure that `collect()` is the most natural place to put this
from an R-user's perspective.
The code I first tried was:
{code}
ds <- open_dataset("cranlogs", partitioning = c("year", "month", "day"), format
= "csv", na = c("", "NA"))
# only ~17% of cran queries include version
since_41 <- ds %>%
filter(date > as.Date("2021-05-18")) %>%
filter(r_version != "NA") %>%
select(date, r_version, r_os, package) %>%
collect()
{code}
This is a pretty common (simple) version of this. Other readers [like
{vroom}|https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files]
that support reading multiple files do this at the read/open step:
{code}
library(vroom)
table <- read_csv(list_of_files, na = c("", "NA"))
{code}
I don't think that this doesn't have to be one or the other, I suspect we could
support specifying it in both places, but we should implement it at the
{{open_dataset()}} step if at all possible to match with other paradigms.
> [R] Accept format-specific scan options in collect()
> ----------------------------------------------------
>
> Key: ARROW-12059
> URL: https://issues.apache.org/jira/browse/ARROW-12059
> Project: Apache Arrow
> Issue Type: Task
> Components: R
> Affects Versions: 4.0.0
> Reporter: David Li
> Priority: Major
> Labels: dataset, datasets
> Fix For: 5.0.0
>
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most
> natural place to accept these is in collect(), but this isn't yet done.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)