[GitHub] [arrow] jorisvandenbossche commented on issue #36709: Guarantee that `group_by` has stable ordering.

via GitHub Wed, 19 Jul 2023 01:27:18 -0700


jorisvandenbossche commented on issue #36709:
URL: https://github.com/apache/arrow/issues/36709#issuecomment-1641651268


   Actually, looking at what changed in the groupby implementation the last 
months, I suppose my clean-up PR https://github.com/apache/arrow/pull/34769 
will have caused this. Before that, pyarrow's `group_by` used a 
`arrow::compute::TableGroupBy` helper under the hood, and that small helper was 
removed to just directly use the Acero declaration the helper was wrapping.  
   
   That should have been fully equivalent, but now I see that the 
`TableGroupBy` helper was having a default of `bool use_threads = false` in its 
header file, while in the new python code we are doing 
`decl.to_table(use_threads=True)`. 
   
   So that will probably explain the difference in behaviour: in 12.0, the 
group_by method was not yet running in parallel, while now it is.
   
   The question is still whether we are fine with this change. We actually _do_ 
have some (hash) aggregations that _do_ depend on the input being ordered (e.g. 
first/last), but I don't think there is a way to "force" doing the calculation 
ordered for other aggregations (like `hash_list`), except for specifying to not 
run in parallel. 
   
   We should probably at least expose `use_threads` in `group_by`, so you can 
still set that to False to keep the old behaviour.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #36709: Guarantee that `group_by` has stable ordering.

Reply via email to