thisisnic commented on issue #45373: URL: https://github.com/apache/arrow/issues/45373#issuecomment-3689362571
@r2evans - here's the summary it came up with. Looks reasonable to me - would you be interested in submitting a PR? ________________________________________ 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## The Problem When you call `arrange() |> summarize()`, the query stores the sort columns in `arrange_vars`. When the query is executed in `ExecPlan$Build()`, the aggregation happens first (producing only the aggregated columns like `min_mpg`), but then the code tries to apply sorting using the original `arrange_vars` which reference columns that no longer exist (like `mpg`). The relevant code flow: 1. `arrange(mpg)` stores `arrange_vars = {mpg}` on the query 2. `summarize()` calls `collapse()` which nests the query - the inner query still has `arrange_vars` 3. In `ExecPlan$Build()` (query-engine.R), the aggregation is applied first (lines 100-129), then sorting is attempted (lines 164-178) 4. The sort fails because `mpg` no longer exists in the schema after aggregation ## Why the `slice_head()` workaround works `slice_head()` triggers its own `collapse()` with `head` set. When that inner query is built, the sorting IS applied (because `head` needs sorted data), and THEN the head is taken. The resulting outer query has empty `arrange_vars`, so when `summarize()` runs, there's nothing to sort. ## The Fix In `do_arrow_summarize()` (dplyr-summarize.R), clear `arrange_vars` before calling `collapse()`: ```r # Clear arrange vars - sorting before aggregation is meaningless for Arrow's # native aggregations, and the columns may not exist after aggregation. # Users who want sorted output should arrange() after summarize(). (GH-45373) .data$arrange_vars <- list() .data$arrange_desc <- logical() out <- collapse.arrow_dplyr_query(.data) ``` This is safe because: - For ungrouped aggregation, sorting beforehand is semantically meaningless - Arrow doesn't have native order-sensitive aggregations (e.g., `first()` falls back to R) - If users want sorted output, they should `arrange()` after `summarize()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
