[
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631279#comment-17631279
]
Lance Dacey commented on ARROW-15474:
-------------------------------------
Nice, I was able to test it out and seemed to get the correct results. I have
been using polars and duckdb to handle de-duplication for a while now so I used
that as a comparison.
{code:java}
%%time
table = con.execute("select distinct on (forecast_group) * from scanner order
by session_id, date").arrow()
CPU times: user 735 ms, sys: 45.7 ms, total: 780 ms
Wall time: 1.92 s
{code}
Your suggestion:
{code:java}
%%time
table = scanner.to_table()
t1 = table.append_column('i', pa.array(np.arange(len(table))))
t2 = t1.group_by(['forecast_group']).aggregate([('i', 'min')]).column('i_min')
table = pc.take(table, t2)
CPU times: user 872 ms, sys: 60.9 ms, total: 933 ms
Wall time: 4.6 s
{code}
A bit slower than duckdb somehow, but for me it is acceptable and gives me an
option to drop duplicates without requiring additional libraries, including
pandas. Thanks!
> [Python] Possibility of a table.drop_duplicates() function?
> -----------------------------------------------------------
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
> Issue Type: Wish
> Components: Python
> Affects Versions: 6.0.1
> Reporter: Lance Dacey
> Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0
> branch. Is it possible to include a drop_duplicates() function as well?
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the
> example above would be great.
> I usually am reading an append-only dataset and then I need to report on
> latest version of each row. To drop duplicates, I am temporarily converting
> the append-only table to a pandas DataFrame, and then I convert it back to a
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at",
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)