nealrichardson commented on issue #45373:
URL: https://github.com/apache/arrow/issues/45373#issuecomment-3693313985

   I haven't looked at this code in a while, but the bug is in here: 
https://github.com/apache/arrow/blob/main/r/R/query-engine.R#L100-L178
   
   Aggregations are treated differently from other steps, like joins, and 
there's apparently different assumptions about when and where `collapse()` has 
happened. I would guess we got to this point because, at least in the 
beginning, there were no aggregation functions that depended on data 
order--everything was fundamentally unordered--so there was no reason to 
`arrange()` before `summarize()`. I'm not sure if that's still the case or not.
   
   Inserting `collapse()` before `summarize()` is the workaround--all it does 
it force those query steps to get constructed before going to the next ones. 
I'm not sure what the best fix in the package is, whether it's reworking the 
logic in the lines I linked, or inserting `collapse()` in the summarize 
function before the aggregations happen (that would probably work but maybe 
would just avoid the bug by convention, not a real bug fix), or just drop 
sorting before summarize like the LLM suggested, but only if there aren't any 
transformations that rely on order--that just sounds like a bug waiting to 
emerge in the future (a larva then, if not a fully grown bug yet?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to