[jira] [Comment Edited] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

Weston Pace (Jira) Mon, 14 Nov 2022 09:32:04 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17633943#comment-17633943
 ]


Weston Pace edited comment on ARROW-15474 at 11/14/22 5:31 PM:
---------------------------------------------------------------

{quote}
Maybe even ordering function can be specified so there would be no need to sort 
the array a priori.
{quote}

You can run your ordering function first so that your ordering is a column in 
the data.  Then, for the first/last kernel the ordering would just be a field 
ref.  The rest would be a straightforward aggregate kernel I think.  The 
"state" would be the current first/last and the value of the ordering field at 
that point.

So yes, I think that approach should work to avoid sorting beforehand.


was (Author: westonpace):
{quote}
Maybe even ordering function can be specified so there would be no need to sort 
the array a priori.
{quote}

You can run your ordering function first so that your ordering is a column in 
the database.  Then, for the first/last kernel the ordering would just be a 
field ref.  The rest would be a straightforward aggregate kernel I think.  The 
"state" would be the current first/last and the value of the ordering field at 
that point.

So yes, I think that approach should work to avoid sorting beforehand.

> [Python] Possibility of a table.drop_duplicates() function?
> -----------------------------------------------------------
>
>                 Key: ARROW-15474
>                 URL: https://issues.apache.org/jira/browse/ARROW-15474
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python
>    Affects Versions: 6.0.1
>            Reporter: Lance Dacey
>            Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

Reply via email to