[GitHub] [arrow] nealrichardson commented on a diff in pull request #33917: GH-33892: [R] Map `dplyr::n()` to `count_all` kernel

via GitHub Thu, 09 Feb 2023 08:13:44 -0800


nealrichardson commented on code in PR #33917:
URL: https://github.com/apache/arrow/pull/33917#discussion_r1101700891



##########
r/R/dplyr-summarize.R:
##########
@@ -322,15 +301,71 @@ arrow_eval_or_stop <- function(expr, mask) {
   out
 }
 
+# This function returns a list of expressions which is used to project the data
+# before an aggregation. This list includes the fields used in the aggregation
+# expressions (the "targets") and the group fields. The names of the returned
+# list are used to ensure that the projection node is wired up correctly to the
+# aggregation node.
 summarize_projection <- function(.data) {
   c(
-    map(.data$aggregations, ~ .$data),
+    unlist(unname(imap(
+      .data$aggregations,
+      ~set_names(
+        .x$data,
+        aggregate_target_names(.x$data, .y)
+      )
+    ))),
     .data$selected_columns[.data$group_by_vars]
   )
 }
 
+# This function determines what names to give to the fields used in an
+# aggregation expression (the "targets"). When an aggregate function takes 2 or
+# more fields as targets, this function gives the fields unique names by
+# appending `..1`, `..2`, etc. When an aggregate function is nullary, this
+# function returns a zero-length character vector.
+aggregate_target_names <- function(data, name) {
+  if (length(data) > 1) {
+    paste(name, seq_along(data), sep = "..")
+  } else if (length(data) > 0) {
+    name
+  } else {
+    character(0)
+  }
+}
+
+# This function returns a named list of the data types of the aggregate columns
+# returned by an aggregation
+aggregate_types <- function(.data, hash, schema = NULL) {

Review Comment:
   Is schema needed as an argument here? The Expressions in $data will all 
bring schema with them. 



##########
r/R/dplyr-collect.R:
##########
@@ -179,18 +179,18 @@ implicit_schema <- function(.data) {
       new_fields <- c(left_fields, right_fields)
     }
   } else {
+    hash <- length(.data$group_by_vars) > 0
     # The output schema is based on the aggregations and any group_by vars
-    new_fields <- map(summarize_projection(.data), ~ .$type(old_schm))
+    new_fields <- c(
+      aggregate_types(.data, hash, old_schm),
+      group_types(.data, old_schm)
+    )

Review Comment:
   If you do this, can you delete L192-194?
   
   ```suggestion
       new_fields <- c(
         group_types(.data, old_schm)
         aggregate_types(.data, hash, old_schm)
       )
   ```



##########
r/R/dplyr-summarize.R:
##########
@@ -322,15 +301,71 @@ arrow_eval_or_stop <- function(expr, mask) {
   out
 }
 
+# This function returns a list of expressions which is used to project the data
+# before an aggregation. This list includes the fields used in the aggregation
+# expressions (the "targets") and the group fields. The names of the returned
+# list are used to ensure that the projection node is wired up correctly to the
+# aggregation node.
 summarize_projection <- function(.data) {
   c(
-    map(.data$aggregations, ~ .$data),
+    unlist(unname(imap(
+      .data$aggregations,
+      ~set_names(
+        .x$data,
+        aggregate_target_names(.x$data, .y)
+      )
+    ))),
     .data$selected_columns[.data$group_by_vars]
   )
 }
 
+# This function determines what names to give to the fields used in an
+# aggregation expression (the "targets"). When an aggregate function takes 2 or
+# more fields as targets, this function gives the fields unique names by
+# appending `..1`, `..2`, etc. When an aggregate function is nullary, this
+# function returns a zero-length character vector.
+aggregate_target_names <- function(data, name) {
+  if (length(data) > 1) {
+    paste(name, seq_along(data), sep = "..")
+  } else if (length(data) > 0) {
+    name
+  } else {
+    character(0)
+  }
+}
+
+# This function returns a named list of the data types of the aggregate columns
+# returned by an aggregation
+aggregate_types <- function(.data, hash, schema = NULL) {
+  map(
+    .data$aggregations,
+    ~if (hash) {
+      Expression$create(
+        paste0("hash_", .$fun),
+        # hash aggregate kernels must be passed another argument representing
+        # the groups, so we pass in a dummy scalar, since the groups will not
+        # affect the type that an aggregation returns
+        args = c(.$data, Scalar$create(1L, uint32())),

Review Comment:
   Minor optimization: we could define `.hash_arg <- Scalar$create(1L, 
uint32())` outside the function so we don't have to keep doing it. There's 
small (<1ms) overhead in creating R6 objects, and this creates 2 every time you 
call it. When I was doing some benchmarking of the query building a while back, 
these all added up such that our conbench TPC-H query benchmarks were being 
driven more by them than evaluating the query itself (since they're not using 
really big data). 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] nealrichardson commented on a diff in pull request #33917: GH-33892: [R] Map `dplyr::n()` to `count_all` kernel

Reply via email to