jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1765974158

   So whenever there is a duplicate in the first field, you take the last 
occurrence of that one (i.e. use the value of the second field of the last 
occurrence)?
   
   In pandas this can be done with `DataFrame.drop_duplicates()` method, where 
you can specify which subset of columns to consider for determining duplicate 
rows, and then which duplicate to keep in the result (first/last). 
   
   We have an issue requesting this feature, see 
https://github.com/apache/arrow/issues/30950. There is some discussion there 
about potential workarounds. 
   
   I think one of the ideas mentioned there actually works nowadays, using 
groupby the key(s) you want to deduplicate, and then aggregate the remaining 
columns with the "last" aggregation.
   
   ```
   In [14]: batch1 = pa.RecordBatch.from_struct_array(pa.array([(1, "1"), (2, 
"2")], pa.struct([("a", "int64"), ("b", "string")])))
   
   In [15]: batch2 = pa.RecordBatch.from_struct_array(pa.array([(1, "3"), (3, 
"4")], pa.struct([("a", "int64"), ("b", "string")])))
   
   In [16]: table = pa.Table.from_batches([batch1, batch2])
   
   In [17]: table
   Out[17]: 
   pyarrow.Table
   a: int64
   b: string
   ----
   a: [[1,2],[1,3]]
   b: [["1","2"],["3","4"]]
   
   In [18]: table.group_by("a", use_threads=False).aggregate([("b", "last")])
   Out[18]: 
   pyarrow.Table
   a: int64
   b_last: string
   ----
   a: [[1,2,3]]
   b_last: [["3","2","4"]]
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to