r2evans commented on issue #45373:
URL: https://github.com/apache/arrow/issues/45373#issuecomment-3690224763
For the record: In the original example, we do lose the sort-column in the
summarization, but if we `mutate` instead of `summarize`, we should still have
the sort column so the sort is likely still appropriate (to the user, even if
it may not affect the calculations).
```r
arrow_table(mtcars) |>
arrange(mpg) |>
mutate(mpg_head = mean(mpg)) |>
collect()
# Error in `compute.arrow_dplyr_query()`:
# ! Invalid: Invalid sort key column: No match for FieldRef.Name(mpg) in
..temp0: double
# ----
# ..temp0:
# [
# [
# 20.090625
# ]
# ]
# Run `rlang::last_trace()` to see where the error occurred.
```
Moving the `arrange()` to _after_ `mutate()` works here, and it _may_ work
with the `summarize()` (if the code were slightly different ... unimportant
atm).
Are you suggesting that it is best to silently drop the sort? If we cannot
preserve `$arrange_*` during the collapse stage, some thoughts:
1. Drop the sort column(s) with a `warning()`. I think this is important,
since the user needs to know that the assumption of ordered data is no longer
true.
2. Is there a way to detect that this problem will occur and rearrange the
steps such that `arrange()` occurs after the collapse? This presents the
possibility that the column may not be available:
- If it is not visible post-collapse (e.g., `summarize()` that drops the
sort columns), then warn-and-drop;
- If it is visible post-collapse, then rearrange the calc-steps so that
the sorting is after the collapse-step(s).
BTW: I had used the `slice_head()` trick, but I ran into problems and had to
remove it, whether for performance reasons or some other fragility. I can't
find "why" I removed it from my code, so this is left dangling atm ...
Bottom line: while I understand that removing the sort in this case might be
the hack to stop the error, I think it needs to be done with a warning instead
of quietly. Thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]