thisisnic commented on issue #45373:
URL: https://github.com/apache/arrow/issues/45373#issuecomment-3689362571

   @r2evans - here's the summary it came up with. Looks reasonable to me - 
would you be interested in submitting a PR?
   
   ________________________________________
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   
   ## The Problem
   
   When you call `arrange() |> summarize()`, the query stores the sort columns 
in `arrange_vars`. When the query is executed in `ExecPlan$Build()`, the 
aggregation happens first (producing only the aggregated columns like 
`min_mpg`), but then the code tries to apply sorting using the original 
`arrange_vars` which reference columns that no longer exist (like `mpg`).
   
   The relevant code flow:
   1. `arrange(mpg)` stores `arrange_vars = {mpg}` on the query
   2. `summarize()` calls `collapse()` which nests the query - the inner query 
still has `arrange_vars`
   3. In `ExecPlan$Build()` (query-engine.R), the aggregation is applied first 
(lines 100-129), then sorting is attempted (lines 164-178)
   4. The sort fails because `mpg` no longer exists in the schema after 
aggregation
   
   ## Why the `slice_head()` workaround works
   
   `slice_head()` triggers its own `collapse()` with `head` set. When that 
inner query is built, the sorting IS applied (because `head` needs sorted 
data), and THEN the head is taken. The resulting outer query has empty 
`arrange_vars`, so when `summarize()` runs, there's nothing to sort.
   
   ## The Fix
   
   In `do_arrow_summarize()` (dplyr-summarize.R), clear `arrange_vars` before 
calling `collapse()`:
   
   ```r
   # Clear arrange vars - sorting before aggregation is meaningless for Arrow's
   # native aggregations, and the columns may not exist after aggregation.
   # Users who want sorted output should arrange() after summarize(). (GH-45373)
   .data$arrange_vars <- list()
   .data$arrange_desc <- logical()
   
   out <- collapse.arrow_dplyr_query(.data)
   ```
   
   This is safe because:
   - For ungrouped aggregation, sorting beforehand is semantically meaningless
   - Arrow doesn't have native order-sensitive aggregations (e.g., `first()` 
falls back to R)
   - If users want sorted output, they should `arrange()` after `summarize()`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to