[GitHub] [arrow] nealrichardson commented on a change in pull request #11373: ARROW-13893: [R] Make head/tail lazy on datasets and queries

GitBox Fri, 08 Oct 2021 12:19:10 -0700


nealrichardson commented on a change in pull request #11373:
URL: https://github.com/apache/arrow/pull/11373#discussion_r725247172




##########
File path: r/R/query-engine.R
##########
@@ -175,9 +183,20 @@ ExecPlan <- R6Class("ExecPlan",
           temp_columns = names(.data$temp_columns)
         )
       }
+
+      # This is only safe because we are going to evaluate queries that end
+      # with head/tail first, then evaluate any subsequent query as a new query
+      if (!is.null(.data$head)) {
+        node$head <- .data$head
+      }
+      if (!is.null(.data$tail)) {
+        node$tail <- .data$tail
+      }
+
       node
     },
     Run = function(node) {
+      # TODO: pass head/tail to ExecPlan_run

Review comment:
       ```suggestion
         # TODO (ARROW-12763): pass head/tail to ExecPlan_run
   ```

##########
File path: r/R/dplyr-collect.R
##########
@@ -19,6 +19,20 @@
 # The following S3 methods are registered on load if dplyr is present
 
 collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) {
+  # head and tail are not ExecNodes, at best we can handle them via sink node
+  # so if there are any steps done after head/tail, we need to
+  # evaluate the query up to then and then do a new query for the rest
+  if (is_collapsed(x) && has_head_tail(x$.data)) {
+    x$.data <- as_adq(dplyr::compute(x$.data))
+  }
+  # TODO: if there are no aggregations or joins or sorting, (and there is

Review comment:
       I will do this in this PR (or maybe not necessary if I can use the 
Scanner interface in do_exec_plan)

##########
File path: r/R/dplyr.R
##########
@@ -162,22 +162,36 @@ as.data.frame.arrow_dplyr_query <- function(x, row.names 
= NULL, optional = FALS
 
 #' @export
 head.arrow_dplyr_query <- function(x, n = 6L, ...) {
-  # TODO (ARROW-13893): refactor
-  out <- head.Dataset(x, n, ...)
-  restore_dplyr_features(out, x)
+  # TODO: test that nrow on collapsed query with agg/join/head

Review comment:
       Will do now

##########
File path: r/R/query-engine.R
##########
@@ -20,6 +20,14 @@ do_exec_plan <- function(.data) {
   final_node <- plan$Build(.data)
   tab <- plan$Run(final_node)$read_table()
 
+  # TODO: can we move all of this logic into plan$Run, making a "source" node?
+  if (!is.null(final_node$head)) {
+    # TODO: implement head off of the RecordBatchReader so you can bail early

Review comment:
       I will do this one here, will try going through 
ScannerBuilder::FromRecordBatchReader and use the existing head/tail methods 
there so as not to reimplement that logic. Assuming that works, I may be able 
to pull this into plan$Run (and do Scanner::ToRecordBatchReader on the other 
side)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a change in pull request #11373: ARROW-13893: [R] Make head/tail lazy on datasets and queries

Reply via email to