westonpace commented on pull request #10118:
URL: https://github.com/apache/arrow/pull/10118#issuecomment-841861097
This PR could use some advice from the R community. I'm adding the ability
to request async (at the moment async is a performance degredation in some
cases when I/O is really fast so until we've made more progress there it will
need to be optional) I've added `UseAsync` to the scanner in R which is used,
for example, like this:
```
test_that("Scanner$ScanBatches", {
ds <- open_dataset(ipc_dir, format = "feather")
batches <- ds$NewScan()$Finish()$ScanBatches()
table <- Table$create(!!!batches)
expect_equivalent(as.data.frame(table), rbind(df1, df2))
batches <- ds$NewScan()$UseAsync(TRUE)$Finish()$ScanBatches()
table <- Table$create(!!!batches)
expect_equivalent(as.data.frame(table), rbind(df1, df2))
})
```
However, most of the examples I see reading a dataset are doing something
like...
```
ds %>%
select(string = chr, integer = int) %>%
filter(integer > 6 & integer < 11) %>%
collect() %>%
summarize(mean = mean(integer))
```
How should `UseAsync` be inserted into such a pattern (chain?) of calls.
Should it be it's own operator:
```
ds %>%
select(string = chr, integer = int) %>%
filter(integer > 6 & integer < 11) %>%
use_async() %>%
collect() %>%
summarize(mean = mean(integer))
```
...or an argument to `collect`:
```
ds %>%
select(string = chr, integer = int) %>%
filter(integer > 6 & integer < 11) %>%
collect(use_async=TRUE) %>%
summarize(mean = mean(integer))
```
...or exposed some other different way?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]