[GitHub] [arrow] ianmcook commented on a change in pull request #10992: ARROW-13740: [R] summarize() should not eagerly evaluate

GitBox Wed, 25 Aug 2021 07:34:34 -0700


ianmcook commented on a change in pull request #10992:
URL: https://github.com/apache/arrow/pull/10992#discussion_r695816039




##########
File path: r/R/query-engine.R
##########
@@ -42,11 +55,73 @@ ExecPlan <- R6Class("ExecPlan",
       }
       # ScanNode needs the filter to do predicate pushdown and skip partitions,
       # and it needs to know which fields to materialize (and which are 
unnecessary)
-      ExecNode_Scan(self, dataset, filter, colnames)
+      ExecNode_Scan(self, dataset, filter, colnames %||% character(0))
+    },
+    Build = function(.data) {
+      # This method takes an arrow_dplyr_query and chains together the
+      # ExecNodes that they produce. It does not evaluate them--that is Run().
+      group_vars <- dplyr::group_vars(.data)
+      grouped <- length(group_vars) > 0
+
+      # Collect the target names first because we have to add back the group 
vars
+      target_names <- names(.data)
+      .data <- ensure_group_vars(.data)
+      .data <- ensure_arrange_vars(.data) # this sets .data$temp_columns
+
+      node <- self$Scan(.data)
+      # ARROW-13498: Even though Scan takes the filter, apparently we have to 
do it again
+      if (inherits(.data$filtered_rows, "Expression")) {
+        node <- node$Filter(.data$filtered_rows)
+      }
+      # If any columns are derived we need to Project (otherwise this may be 
no-op)
+      node <- node$Project(c(.data$selected_columns, .data$temp_columns))
+
+      if (length(.data$aggregations)) {
+        if (grouped) {
+          # We need to prefix all of the aggregation function names with 
"hash_"
+          .data$aggregations <- lapply(.data$aggregations, function(x) {
+            x[["fun"]] <- paste0("hash_", x[["fun"]])
+            x
+          })
+        }
+
+        node <- node$Aggregate(
+          options = .data$aggregations,
+          target_names = target_names,
+          out_field_names = names(.data$aggregations),
+          key_names = group_vars
+        )
+
+        if (grouped) {
+          # The result will have result columns first then the grouping cols.
+          # dplyr orders group cols first, so adapt the result to meet that 
expectation.
+          node <- node$Project(
+            make_field_refs(c(group_vars, names(.data$aggregations)))
+          )
+          # Add sorting instructions for the rows too to match dplyr
+          # (see below about why sorting isn't itself a Node)
+          node$sort <- list(
+            names = group_vars,
+            orders = rep(0L, length(group_vars))
+          )
+        }
+      }
+
+      # Apply sorting: this is currently not an ExecNode itself, it is a
+      # sink node option.
+      # TODO: error if doing a subsequent operation that would throw away 
sorting!

Review comment:
       I think a warning would be more appropriate than an error




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ianmcook commented on a change in pull request #10992: ARROW-13740: [R] summarize() should not eagerly evaluate

Reply via email to