jorisvandenbossche commented on issue #35748: URL: https://github.com/apache/arrow/issues/35748#issuecomment-1765974158
So whenever there is a duplicate in the first field, you take the last occurrence of that one (i.e. use the value of the second field of the last occurrence)? In pandas this can be done with `DataFrame.drop_duplicates()` method, where you can specify which subset of columns to consider for determining duplicate rows, and then which duplicate to keep in the result (first/last). We have an issue requesting this feature, see https://github.com/apache/arrow/issues/30950. There is some discussion there about potential workarounds. I think one of the ideas mentioned there actually works nowadays, using groupby the key(s) you want to deduplicate, and then aggregate the remaining columns with the "last" aggregation. ``` In [14]: batch1 = pa.RecordBatch.from_struct_array(pa.array([(1, "1"), (2, "2")], pa.struct([("a", "int64"), ("b", "string")]))) In [15]: batch2 = pa.RecordBatch.from_struct_array(pa.array([(1, "3"), (3, "4")], pa.struct([("a", "int64"), ("b", "string")]))) In [16]: table = pa.Table.from_batches([batch1, batch2]) In [17]: table Out[17]: pyarrow.Table a: int64 b: string ---- a: [[1,2],[1,3]] b: [["1","2"],["3","4"]] In [18]: table.group_by("a", use_threads=False).aggregate([("b", "last")]) Out[18]: pyarrow.Table a: int64 b_last: string ---- a: [[1,2,3]] b_last: [["3","2","4"]] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
