nealrichardson commented on code in PR #41223:
URL: https://github.com/apache/arrow/pull/41223#discussion_r1566560742
##########
r/R/dplyr-summarize.R:
##########
@@ -221,25 +257,27 @@ do_arrow_summarize <- function(.data, ..., .groups =
NULL) {
# It's more complex than other places because a single summarize() expr
# may result in multiple query nodes (Aggregate, Project),
# and we have to walk through the expressions to disentangle them.
- ctx <- env(
- mask = arrow_mask(.data, aggregation = TRUE),
- aggregations = empty_named_list(),
- post_mutate = empty_named_list()
- )
+
+ # Agg functions pull out the aggregation info and append it here
+ ..aggregations <- empty_named_list()
+ # And if there are any transformations after the aggregation, they go here
+ ..post_mutate <- empty_named_list()
+ mask <- arrow_mask(.data, aggregation = TRUE)
+
for (i in seq_along(exprs)) {
# Iterate over the indices and not the names because names may be repeated
# (which overwrites the previous name)
summarize_eval(
names(exprs)[i],
exprs[[i]],
- ctx,
+ mask,
length(.data$group_by_vars) > 0
)
}
# Apply the results to the .data object.
# First, the aggregations
- .data$aggregations <- ctx$aggregations
+ .data$aggregations <- ..aggregations
Review Comment:
I'll explain here, and then I'll add more in comments.
`summarize()` is complicated because you can do a mixture of scalar
operations and aggregations, but that's not how acero works. So we have to pull
out the aggregations, collect them in one list (that will become an Aggregate
ExecNode), and in the expressions, replace them with FieldRefs so that further
operations can happen (in what will become a ProjectNode that works on the
result of the Aggregate).
In "normal" arrow_eval, like in `mutate()`, each expression/quosure results
in a single Arrow Expression. But in `summarize()`, it could generate one or
more aggregations that go into Aggregate, and then an Expression after that.
Example from the comments in `do_arrow_summarize()`:
```
# For example,
# summarize(mean = sum(x) / n())
# is effectively implemented as
# summarize(..temp0 = sum(x), ..temp1 = n()) %>%
# mutate(mean = ..temp0 / ..temp1) %>%
# select(-starts_with("..temp"))
```
So, each aggregation binding needs to push a `..tempN` aggregation onto some
list somewhere and return a FieldRef so that any projections happening after
evaluate "normally".
`ctx` was useful previously because we weren't calling `arrow_eval()` on the
expressions, we were walking them, looking for known aggregation functions,
pulling them out and inserting into the expression the `..tempN` symbols, and
then once it was all scalar functions left, calling `arrow_eval()`. But that
doesn't allow for defining a `weighted.mean` binding as `function(x, w) sum(x *
w) / sum(w)` because you can't just substitute into the call to
`weighted.mean(x, w)`, you really just want to evaluate it and let the usual
Arrow Expression logic work.
The challenge was: where is that "somewhere" to collect the aggregations? It
needs to be somewhere where the binding functions can find it, we can't pass it
in as an argument everywhere.
Generally, R looks up symbols in each parent frame of where the function is
defined, like this:
```
> x <- 1
> f <- function() x
> f()
[1] 1
```
So you'd think that you could just have `..aggregations` in this function,
and when you evaluate the expressions, they would find them in the enclosing
environment. But no. It's like this:
```
> g <- function() {
+ x <- 2
+ f()
+ }
> g()
[1] 1
```
`f()` has its own environment and looks up symbols from its parents.
So, I put `..aggregations` in this environment, and then set *this*
environment as the parent for each of the aggregation bindings in
`arrow_mask()`. That way, they can find it and assign into it. This seemed
better than the alternatives I almost gave up and fell back to, like something
global/in the package namespace.
After doing that, I realized that I could do a similar thing with user
functions: copy them into the mask and set their environment to be the mask, so
they'd find the other bindings there.
Does that make sense?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]