This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/main by this push:
     new ab432b1362 GH-43627: [R] Fix summarize() performance regression 
(pushdown) (#43649)
ab432b1362 is described below

commit ab432b1362208696e60824b45a5599a4e91e6301
Author: Neal Richardson <[email protected]>
AuthorDate: Wed Aug 14 07:50:04 2024 -0700

    GH-43627: [R] Fix summarize() performance regression (pushdown) (#43649)
    
    ### Rationale for this change
    
    See https://github.com/apache/arrow/issues/43627#issuecomment-2284259559
    
    ### What changes are included in this PR?
    
    An extra `dplyr::select()`
    
    ### Are these changes tested?
    
    Conbench should show that the performance is much better
    
    ### Are there any user-facing changes?
    
    Not slow
    * GitHub Issue: #43627
---
 r/R/dplyr-summarize.R | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/r/R/dplyr-summarize.R b/r/R/dplyr-summarize.R
index f4fda0f13a..a9ad750de7 100644
--- a/r/R/dplyr-summarize.R
+++ b/r/R/dplyr-summarize.R
@@ -43,6 +43,15 @@ do_arrow_summarize <- function(.data, ..., .groups = NULL) {
     hash = length(.data$group_by_vars) > 0
   )
 
+  # Do a projection here to keep only the columns we need in summarize().
+  # If possible, this will push down the column selection into the SourceNode,
+  # saving lots of wasted processing for columns we don't need. (GH-43627)
+  vars_to_keep <- unique(c(
+    unlist(lapply(exprs, all.vars)), # vars referenced in summarize
+    dplyr::group_vars(.data) # vars needed for grouping
+  ))
+  .data <- dplyr::select(.data, intersect(vars_to_keep, names(.data)))
+
   # nolint start
   # summarize() is complicated because you can do a mixture of scalar 
operations
   # and aggregations, but that's not how Acero works. For example, for us to do

Reply via email to