westonpace commented on issue #14882:
URL: https://github.com/apache/arrow/issues/14882#issuecomment-1347260487
Aggregate methods, in particular the `list` aggregate method, can get you
there but it is a bit clumsy I think:
```
import pyarrow as pa
tab = pa.Table.from_pydict({"key": [1, 1, 1, 2, 2], "value1": [1, 2, 3, 4,
5], "value2": ["a", "b", "c", "d", "e"]})
lists = tab.group_by("key").aggregate([("value1", "list"), ("value2",
"list")])
keys = lists.column("key")
for i in range(lists.num_rows):
print(f"Table for group: {keys[i]}")
val1_col = lists.column("value1_list")[i].values
val2_col = lists.column("value2_list")[i].values
sub_table = pa.Table.from_arrays([val1_col, val2_col], names=["value1",
"value2"])
print(sub_table)
print()
```
I would support a new API for this. However, this will probably need
changes at the C++ level. An intuitive implementation I think would lead to a
single exec plan that had multiple sinks (or we could just canonicalize the
above usage of the list aggregate function and provide convenience for it).
This will be very useful, for example, for someone that wanted to partition
a table based on some column (or the hash of a column) and distribute the work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]