Lance Dacey created ARROW-15474:
-----------------------------------

             Summary: [Python] Possibility of a table.drop_duplicates() 
function?
                 Key: ARROW-15474
                 URL: https://issues.apache.org/jira/browse/ARROW-15474
             Project: Apache Arrow
          Issue Type: Wish
    Affects Versions: 6.0.1
            Reporter: Lance Dacey
             Fix For: 8.0.0


I noticed that there is a group_by() and sort_by() function in the 7.0.0 
branch. Is it possible to include a drop_duplicates() function as well? 

||id||updated_at||
|1|2022-01-01 04:23:57|
|2|2022-01-01 07:19:21|
|2|2022-01-10 22:14:01|

Something like this which would return a table without the second row in the 
example above would be great. 

I usually am reading an append-only dataset and then I need to report on latest 
version of each row. To drop duplicates, I am temporarily converting the 
append-only table to a pandas DataFrame, and then I convert it back to a table 
and save a separate "latest-version" dataset.

{code:python}
table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
"ascending")]).drop_duplicates(subset=["id"] keep="last")
{code}








--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to