[GitHub] [arrow] westonpace commented on issue #14882: [Python] How to do arrow table group by and split?

GitBox Mon, 12 Dec 2022 12:29:22 -0800


westonpace commented on issue #14882:
URL: https://github.com/apache/arrow/issues/14882#issuecomment-1347260487


   Aggregate methods, in particular the `list` aggregate method, can get you 
there but it is a bit clumsy I think:
   
   ```
   import pyarrow as pa
   
   tab = pa.Table.from_pydict({"key": [1, 1, 1, 2, 2], "value1": [1, 2, 3, 4, 
5], "value2": ["a", "b", "c", "d", "e"]})
   lists = tab.group_by("key").aggregate([("value1", "list"), ("value2", 
"list")])
   
   keys = lists.column("key")
   for i in range(lists.num_rows):
       print(f"Table for group: {keys[i]}")
       val1_col = lists.column("value1_list")[i].values
       val2_col = lists.column("value2_list")[i].values
       sub_table = pa.Table.from_arrays([val1_col, val2_col], names=["value1", 
"value2"])
       print(sub_table)
       print()
   ```
   
   I would support a new API for this.  However, this will probably need 
changes at the C++ level.  An intuitive implementation I think would lead to a 
single exec plan that had multiple sinks (or we could just canonicalize the 
above usage of the list aggregate function and provide convenience for it).
   
   This will be very useful, for example, for someone that wanted to partition 
a table based on some column (or the hash of a column) and distribute the work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #14882: [Python] How to do arrow table group by and split?

Reply via email to