ianmcook commented on code in PR #33917:
URL: https://github.com/apache/arrow/pull/33917#discussion_r1095222011
##########
r/R/dplyr-summarize.R:
##########
@@ -322,15 +301,76 @@ arrow_eval_or_stop <- function(expr, mask) {
out
}
+# This function returns a list of expressions that can be used to project the
+# data before an aggregation to only the fields required for the aggregation,
+# including the fields used in the aggregations (the "targets") and the group
+# fields. The names of the returned list are used to ensure that the projection
+# node is wired up correctly to the aggregation node.
summarize_projection <- function(.data) {
c(
- map(.data$aggregations, ~ .$data),
+ unlist(unname(imap(
+ .data$aggregations,
+ ~set_names(
+ .x$data,
+ aggregate_target_names(.x$data, .y)
+ )
+ ))),
+ .data$selected_columns[.data$group_by_vars]
+ )
+}
+
+# This function determines what names to give to the fields used in
aggregations
+# (the "targets"). When an aggregate function takes 2 or more fields as
targets,
+# this function gives the fields unique names by appending `..1`, `..2`, etc.
+# When an aggregate function is nullary, this function returns a zero-length
+# character vector.
+aggregate_target_names <- function(data, name) {
+ if (length(data) > 1) {
+ paste(name, seq_along(data), sep = "..")
+ } else if (length(data) > 0) {
+ name
+ } else {
+ character(0)
+ }
+}
+
+# This function returns a list of expressions representing the aggregated
fields
+# that will be returned by an aggregation
+aggregated_fields <- function(aggs) {
+ map(
+ aggs,
+ ~Expression$create(.$fun, args = .$data, options = .$options)
+ )
+}
+
+# Unlike with other pairs of non-hash/hash aggregate kernels in the Arrow C++
+# library, the `tdigest` and `hash_tdigest` kernels have different output
types.
Review Comment:
Very cool!
I changed it to work like this, but I used `Scalar$create(1L, uint32())`
instead of `Expression$scalar(1L)$cast(uint32())` as the groups argument.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]