[GitHub] [arrow] jonkeane commented on a change in pull request #11894: ARROW-14029: [R] Repair map_batches()

GitBox Thu, 06 Jan 2022 15:12:25 -0800


jonkeane commented on a change in pull request #11894:
URL: https://github.com/apache/arrow/pull/11894#discussion_r779925693




##########
File path: r/R/dataset-scan.R
##########
@@ -185,17 +185,36 @@ ScanTask <- R6Class("ScanTask",
 #' `data.frame`? Default `TRUE`
 #' @export
 map_batches <- function(X, FUN, ..., .data.frame = TRUE) {
-  if (.data.frame) {
-    lapply <- map_dfr
-  }
-  scanner <- Scanner$create(ensure_group_vars(X))
+  # TODO(ARROW-15271): possibly refactor do_exec_plan to return a 
RecordBatchReader

Review comment:
       ```suggestion
     # TODO: ARROW-15271 possibly refactor do_exec_plan to return a 
RecordBatchReader
   ```
   
   Super minor, but most of our TODOs have the semicolon and then the issue 
number (though there are others that do it with a space too)

##########
File path: r/tests/testthat/test-dataset.R
##########
@@ -453,15 +453,38 @@ test_that("Creating UnionDataset", {
 })
 
 test_that("map_batches", {
-  skip("map_batches() is broken (ARROW-14029)")
   ds <- open_dataset(dataset_dir, partitioning = "part")
+
+  # summarize returns arrow_dplyr_query, which gets collected into a tibble
   expect_equal(
     ds %>%
       filter(int > 5) %>%
       select(int, lgl) %>%
-      map_batches(~ summarize(., min_int = min(int))),
+      map_batches(~ summarize(., min_int = min(int))) %>%
+      arrange(min_int),
     tibble(min_int = c(6L, 101L))
   )
+
+  # $num_rows returns integer vector
+  expect_equal(
+    ds %>%
+      filter(int > 5) %>%
+      select(int, lgl) %>%
+      map_batches(~ .$num_rows, .data.frame = FALSE) %>%
+      unlist() %>% # Returns list because .data.frame is FALSE

Review comment:
       Thanks for this fantastic _extra_ clarification!

##########
File path: r/vignettes/dataset.Rmd
##########
@@ -290,6 +290,64 @@ rows match the filter. Relatedly, since Parquet files 
contain row groups with
 statistics on the data within, there may be entire chunks of data you can
 avoid scanning because they have no rows where `total_amount > 100`.
 
+### Processing data in batches

Review comment:
       I like it a lot. And I think it totally belongs here in a vignette 
(especially in the tone you have here). But it wouldn't be bad to make an issue 
to add to the cookbook as well (though don't feel obligated to do that right 
now if you don't want to!). 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on a change in pull request #11894: ARROW-14029: [R] Repair map_batches()

Reply via email to