[ 
https://issues.apache.org/jira/browse/ARROW-12059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359591#comment-17359591
 ] 

Jonathan Keane commented on ARROW-12059:
----------------------------------------

Working on an independent task I ran into this (and followed the issues to make 
sure we've got it covered). 

I'm not totally sure that `collect()` is the most natural place to put this 
from an R-user's perspective. 

The code I first tried was:
{code}
ds <- open_dataset("cranlogs", partitioning = c("year", "month", "day"), format 
= "csv", na = c("", "NA"))

# only ~17% of cran queries include version
since_41 <- ds %>%
  filter(date > as.Date("2021-05-18")) %>%
  filter(r_version != "NA") %>%
  select(date, r_version, r_os, package) %>%
  collect()
{code}

This is a pretty common (simple) version of this. Other readers [like 
{vroom}|https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#reading-multiple-files]
 that support reading multiple files do this at the read/open step:

{code}
library(vroom)

table <- read_csv(list_of_files, na = c("", "NA"))
{code}

I don't think that this doesn't have to be one or the other, I suspect we could 
support specifying it in both places, but we should implement it at the 
{{open_dataset()}} step if at all possible to match with other paradigms.

> [R] Accept format-specific scan options in collect()
> ----------------------------------------------------
>
>                 Key: ARROW-12059
>                 URL: https://issues.apache.org/jira/browse/ARROW-12059
>             Project: Apache Arrow
>          Issue Type: Task
>          Components: R
>    Affects Versions: 4.0.0
>            Reporter: David Li
>            Priority: Major
>              Labels: dataset, datasets
>             Fix For: 5.0.0
>
>
> ARROW-9749 and ARROW-8631 added format/scan-specific options. In R, the most 
> natural place to accept these is in collect(), but this isn't yet done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to