Re: [PR] [R] Redo how summarize() evaluates expressions [arrow]

via GitHub Mon, 15 Apr 2024 17:04:04 -0700


nealrichardson commented on code in PR #41223:
URL: https://github.com/apache/arrow/pull/41223#discussion_r1566560742



##########
r/R/dplyr-summarize.R:
##########
@@ -221,25 +257,27 @@ do_arrow_summarize <- function(.data, ..., .groups = 
NULL) {
   # It's more complex than other places because a single summarize() expr
   # may result in multiple query nodes (Aggregate, Project),
   # and we have to walk through the expressions to disentangle them.
-  ctx <- env(
-    mask = arrow_mask(.data, aggregation = TRUE),
-    aggregations = empty_named_list(),
-    post_mutate = empty_named_list()
-  )
+
+  # Agg functions pull out the aggregation info and append it here
+  ..aggregations <- empty_named_list()
+  # And if there are any transformations after the aggregation, they go here
+  ..post_mutate <- empty_named_list()
+  mask <- arrow_mask(.data, aggregation = TRUE)
+
   for (i in seq_along(exprs)) {
     # Iterate over the indices and not the names because names may be repeated
     # (which overwrites the previous name)
     summarize_eval(
       names(exprs)[i],
       exprs[[i]],
-      ctx,
+      mask,
       length(.data$group_by_vars) > 0
     )
   }
 
   # Apply the results to the .data object.
   # First, the aggregations
-  .data$aggregations <- ctx$aggregations
+  .data$aggregations <- ..aggregations

Review Comment:
   I'll explain here, and then I'll add more in comments. 
   
   `summarize()` is complicated because you can do a mixture of scalar 
operations and aggregations, but that's not how acero works. So we have to pull 
out the aggregations, collect them in one list (that will become an Aggregate 
ExecNode), and in the expressions, replace them with FieldRefs so that further 
operations can happen (in what will become a ProjectNode that works on the 
result of the Aggregate).
   
   In "normal" arrow_eval, like in `mutate()`, each expression/quosure results 
in a single Arrow Expression. But in `summarize()`, it could generate one or 
more aggregations that go into Aggregate, and then an Expression after that. 
Example from the comments in `do_arrow_summarize()`:
   
   ```
     # For example,
     #   summarize(mean = sum(x) / n())
     # is effectively implemented as
     #   summarize(..temp0 = sum(x), ..temp1 = n()) %>%
     #   mutate(mean = ..temp0 / ..temp1) %>%
     #   select(-starts_with("..temp"))
   ```
   
   So, each aggregation binding needs to push a `..tempN` aggregation onto some 
list somewhere and return a FieldRef so that any projections happening after 
evaluate "normally".
   
   `ctx` was useful previously because we weren't calling `arrow_eval()` on the 
expressions, we were walking them, looking for known aggregation functions, 
pulling them out and inserting into the expression the `..tempN` symbols, and 
then once it was all scalar functions left, calling `arrow_eval()`. But that 
doesn't allow for defining a `weighted.mean` binding as `function(x, w) sum(x * 
w) / sum(w)` because you can't just substitute into the call to 
`weighted.mean(x, w)`, you really just want to evaluate it and let the usual 
Arrow Expression logic work. 
   
   The challenge was: where is that "somewhere" to collect the aggregations? It 
needs to be somewhere where the binding functions can find it, we can't pass it 
in as an argument everywhere. 
   
   Generally, R looks up symbols in each parent frame of where the function is 
defined, like this:
   
   ```
   > x <- 1
   > f <- function() x
   > f()
   [1] 1
   ```
   
   So you'd think that you could just have `..aggregations` in this function, 
and when you evaluate the expressions, they would find them in the enclosing 
environment. But no. It's like this:
   
   ```
   > g <- function() {
   +   x <- 2
   +   f()
   + }
   > g()
   [1] 1
   ```
   
   `f()` has its own environment and looks up symbols from its parents. 
   
   So, I put `..aggregations` in this environment, and then set *this* 
environment as the parent for each of the aggregation bindings in 
`arrow_mask()`. That way, they can find it and assign into it. This seemed 
better than the alternatives I almost gave up and fell back to, like something 
global/in the package namespace.
   
   After doing that, I realized that I could do a similar thing with user 
functions: copy them into the mask and set their environment to be the mask, so 
they'd find the other bindings there.
   
   Does that make sense?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [R] Redo how summarize() evaluates expressions [arrow]

Reply via email to