jorisvandenbossche commented on issue #36709: URL: https://github.com/apache/arrow/issues/36709#issuecomment-1641651268
Actually, looking at what changed in the groupby implementation the last months, I suppose my clean-up PR https://github.com/apache/arrow/pull/34769 will have caused this. Before that, pyarrow's `group_by` used a `arrow::compute::TableGroupBy` helper under the hood, and that small helper was removed to just directly use the Acero declaration the helper was wrapping. That should have been fully equivalent, but now I see that the `TableGroupBy` helper was having a default of `bool use_threads = false` in its header file, while in the new python code we are doing `decl.to_table(use_threads=True)`. So that will probably explain the difference in behaviour: in 12.0, the group_by method was not yet running in parallel, while now it is. The question is still whether we are fine with this change. We actually _do_ have some (hash) aggregations that _do_ depend on the input being ordered (e.g. first/last), but I don't think there is a way to "force" doing the calculation ordered for other aggregations (like `hash_list`), except for specifying to not run in parallel. We should probably at least expose `use_threads` in `group_by`, so you can still set that to False to keep the old behaviour. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
