nealrichardson commented on issue #45373: URL: https://github.com/apache/arrow/issues/45373#issuecomment-3693313985
I haven't looked at this code in a while, but the bug is in here: https://github.com/apache/arrow/blob/main/r/R/query-engine.R#L100-L178 Aggregations are treated differently from other steps, like joins, and there's apparently different assumptions about when and where `collapse()` has happened. I would guess we got to this point because, at least in the beginning, there were no aggregation functions that depended on data order--everything was fundamentally unordered--so there was no reason to `arrange()` before `summarize()`. I'm not sure if that's still the case or not. Inserting `collapse()` before `summarize()` is the workaround--all it does it force those query steps to get constructed before going to the next ones. I'm not sure what the best fix in the package is, whether it's reworking the logic in the lines I linked, or inserting `collapse()` in the summarize function before the aggregations happen (that would probably work but maybe would just avoid the bug by convention, not a real bug fix), or just drop sorting before summarize like the LLM suggested, but only if there aren't any transformations that rely on order--that just sounds like a bug waiting to emerge in the future (a larva then, if not a fully grown bug yet?). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
