nealrichardson opened a new pull request, #41350:
URL: https://github.com/apache/arrow/pull/41350
### Rationale for this change
Since it doesn't look like Acero will be getting window functions any time
soon, implement support in `mutate()` for transformations that involve
aggregations, like `x - mean(x)`, via left_join.
### What changes are included in this PR?
Following #41223, I realized I could reuse that evaluation path in
`mutate()`. Evaluating expressions accumulates `..aggregations` and
`mutate_stuff`; in summarize() we apply aggregations and then mutate on the
result. If expressions in the `mutate_stuff` reference columns in the original
data and not just the result of aggregations, we reject it.
Here, if there are aggregations, we apply them on a copy of the query up to
that point, and join the result back onto the query, then apply the mutations
on that. It's not a problem for those mutate expressions to reference both
columns in the original data and the results of the aggregations because both
are present.
There are two caveats:
* Join has non-deterministic order, so while `mutate()` doesn't generally
affect row order, if this code path is activated, row order may not be stable.
* Acero's join seems to have a limitation currently where missing values are
not joined to each other. If your join key has NA in it, and you do a
left_join, your new columns will all be NA, even if there is a corresponding
value in the right dataset. I'll make an issue for that, and if it's not
readily fixable in Acero, I can think of some workarounds.
### Are these changes tested?
Yes
### Are there any user-facing changes?
This works now:
``` r
library(arrow)
library(dplyr)
mtcars |>
arrow_table() |>
select(cyl, mpg, hp) |>
group_by(cyl) |>
mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |>
collect()
#> # A tibble: 32 × 4
#> # Groups: cyl [3]
#> cyl mpg hp stdize_mpg
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 21 110 0.865
#> 2 6 21 110 0.865
#> 3 4 22.8 93 -0.857
#> 4 6 21.4 110 1.14
#> 5 8 18.7 175 1.41
#> 6 6 18.1 105 -1.13
#> 7 8 14.3 245 -0.312
#> 8 4 24.4 62 -0.502
#> 9 4 22.8 95 -0.857
#> 10 6 19.2 123 -0.373
#> # ℹ 22 more rows
```
<sup>Created on 2024-04-23 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]