westonpace commented on code in PR #14482:
URL: https://github.com/apache/arrow/pull/14482#discussion_r1003583810
##########
python/pyarrow/table.pxi:
##########
@@ -5282,6 +5282,7 @@ class TableGroupBy:
list[tuple(str, str, FunctionOptions)]
List of tuples made of aggregation column names followed
by function names and optionally aggregation function options.
+ Pass empty list to imitate drop_duplicates pandas function.
Review Comment:
It's not quite the same though. Pandas `drop_duplicates` will keep columns
that are not key columns. By default it will keep the first value in each
group, though this is configurable. For example:
```
>>> tab = pa.Table.from_pydict({"x": [1, 1, 1, 2, 2], "y": ["a", "b", "c",
"d", "e"]})
>>> pa.TableGroupBy(tab, "x").aggregate([])
pyarrow.Table
x: int64
----
x: [[1,2]]
```
With `drop_duplicates` you would also get `y: [["a", "d"]]`. You can kind
of imitate this by using the `one` function which just picks some arbitrary
value from a non-key column ("first" and "last" are difficult concepts within
datasets at the moment).
```
>>> pa.TableGroupBy(tab, "x").aggregate([("y", "one")])
pyarrow.Table
y_one: string
x: int64
----
y_one: [["a","d"]]
x: [[1,2]]
```
Either way, maybe this should be:
```suggestion
Pass empty list to get a single row for each group.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]