nealrichardson commented on code in PR #41223:
URL: https://github.com/apache/arrow/pull/41223#discussion_r1566547358
##########
r/R/dplyr-summarize.R:
##########
@@ -129,35 +129,71 @@ register_bindings_aggregate <- function() {
notes = "approximate median (t-digest) is computed"
)
register_binding_agg("dplyr::n_distinct", function(..., na.rm = FALSE) {
- list(
+ set_agg(
fun = "count_distinct",
data = ensure_one_arg(list2(...), "n_distinct"),
options = list(na.rm = na.rm)
)
})
register_binding_agg("dplyr::n", function() {
- list(
+ set_agg(
fun = "count_all",
data = list(),
options = list()
)
})
register_binding_agg("base::min", function(..., na.rm = FALSE) {
- list(
+ set_agg(
fun = "min",
data = ensure_one_arg(list2(...), "min"),
options = list(skip_nulls = na.rm, min_count = 0L)
)
})
register_binding_agg("base::max", function(..., na.rm = FALSE) {
- list(
+ set_agg(
fun = "max",
data = ensure_one_arg(list2(...), "max"),
options = list(skip_nulls = na.rm, min_count = 0L)
)
})
}
+set_agg <- function(...) {
+ agg_data <- list2(...)
+ # Find the environment where ..aggregations is stored
+ target <- find_aggregations_env()
+ aggs <- get("..aggregations", target)
+ lapply(agg_data[["data"]], function(expr) {
+ # If any of the fields referenced in the expression are in ..aggregations,
+ # then we can't aggregate over them.
+ # This is mainly for combinations of dataset columns and aggregations,
+ # like sum(x - mean(x)), i.e. window functions.
+ # This will reject (sum(sum(x)) as well, but that's not a useful operation.
+ if (any(expr$field_names_in_expression() %in% names(aggs))) {
+ # TODO: support in ARROW-13926
+ arrow_not_supported("aggregate within aggregate expression")
+ }
+ })
+
+ # Record the (fun, data, options) in ..aggregations
+ # and return a FieldRef pointing to it
+ tmpname <- paste0("..temp", length(aggs))
+ aggs[[tmpname]] <- agg_data
+ assign("..aggregations", aggs, envir = target)
+ Expression$field_ref(tmpname)
+}
+
+find_aggregations_env <- function() {
+ # Find the environment where ..aggregations is stored,
+ # it's in parent.env of something in the call stack
+ for (f in sys.frames()) {
+ if (exists("..aggregations", envir = f)) {
+ return(f)
+ }
+ }
+ stop("Could not find ..aggregations")
Review Comment:
Correct, yeah, seemed like logically this case needed to be caught but I
don't anticipate it being hit. If you're in `set_agg()`, somewhere up in the
list of parent/enclosing environments, you'll find the one from
`summarize_eval()` that contains this.
If you were to call the binding outside of `summarize()`, like this, you'd
hit it, but that's not a real use case:
```
> arrow:::agg_funcs[["sum"]](1)
Error in find_aggregations_env() : Could not find ..aggregations
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]