This is an automated email from the ASF dual-hosted git repository.
npr pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/main by this push:
new ab432b1362 GH-43627: [R] Fix summarize() performance regression
(pushdown) (#43649)
ab432b1362 is described below
commit ab432b1362208696e60824b45a5599a4e91e6301
Author: Neal Richardson <[email protected]>
AuthorDate: Wed Aug 14 07:50:04 2024 -0700
GH-43627: [R] Fix summarize() performance regression (pushdown) (#43649)
### Rationale for this change
See https://github.com/apache/arrow/issues/43627#issuecomment-2284259559
### What changes are included in this PR?
An extra `dplyr::select()`
### Are these changes tested?
Conbench should show that the performance is much better
### Are there any user-facing changes?
Not slow
* GitHub Issue: #43627
---
r/R/dplyr-summarize.R | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/r/R/dplyr-summarize.R b/r/R/dplyr-summarize.R
index f4fda0f13a..a9ad750de7 100644
--- a/r/R/dplyr-summarize.R
+++ b/r/R/dplyr-summarize.R
@@ -43,6 +43,15 @@ do_arrow_summarize <- function(.data, ..., .groups = NULL) {
hash = length(.data$group_by_vars) > 0
)
+ # Do a projection here to keep only the columns we need in summarize().
+ # If possible, this will push down the column selection into the SourceNode,
+ # saving lots of wasted processing for columns we don't need. (GH-43627)
+ vars_to_keep <- unique(c(
+ unlist(lapply(exprs, all.vars)), # vars referenced in summarize
+ dplyr::group_vars(.data) # vars needed for grouping
+ ))
+ .data <- dplyr::select(.data, intersect(vars_to_keep, names(.data)))
+
# nolint start
# summarize() is complicated because you can do a mixture of scalar
operations
# and aggregations, but that's not how Acero works. For example, for us to do