[GitHub] [arrow] westonpace commented on pull request #10118: ARROW-12468: Expose ScannerBuilder::UseAsync to python & R

GitBox Sun, 16 May 2021 11:57:34 -0700


westonpace commented on pull request #10118:
URL: https://github.com/apache/arrow/pull/10118#issuecomment-841861097



   This PR could use some advice from the R community.  I'm adding the ability 
to request async (at the moment async is a performance degredation in some 
cases when I/O is really fast so until we've made more progress there it will 
need to be optional)  I've added `UseAsync` to the scanner in R which is used, 
for example, like this:
   
   ```
   test_that("Scanner$ScanBatches", {
     ds <- open_dataset(ipc_dir, format = "feather")
     batches <- ds$NewScan()$Finish()$ScanBatches()
     table <- Table$create(!!!batches)
     expect_equivalent(as.data.frame(table), rbind(df1, df2))
   
     batches <- ds$NewScan()$UseAsync(TRUE)$Finish()$ScanBatches()
     table <- Table$create(!!!batches)
     expect_equivalent(as.data.frame(table), rbind(df1, df2))
   })
   ```
   However, most of the examples I see reading a dataset are doing something 
like...
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         collect() %>%
         summarize(mean = mean(integer))
   ```
   
   How should `UseAsync` be inserted into such a pattern (chain?) of calls.  
Should it be it's own operator:
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         use_async() %>%
         collect() %>%
         summarize(mean = mean(integer))
   ```
   
   ...or an argument to `collect`:
   
   ```
   ds %>%
         select(string = chr, integer = int) %>%
         filter(integer > 6 & integer < 11) %>%
         collect(use_async=TRUE) %>%
         summarize(mean = mean(integer))
   ```
   ...or exposed some other different way?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #10118: ARROW-12468: Expose ScannerBuilder::UseAsync to python & R

Reply via email to