jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1578291747
Yes, but what you call "merge" is essentially a "concat (or combine into
chunked arrya) + uniques + sort" ?
```
>>> merged_chunked = pa.chunked_array([chunk for arr in [a1, a2] for chunk
in arr.chunks])
>>> merged_chunked.unique().sort()
<pyarrow.lib.Int64Array object at 0x7f79b66b3520>
[
1,
2,
3,
4,
6,
7,
8
]
```
And the question is then if this can be done more efficiently than the
combination of those existing kernels, if you know the input is already unique
and/or is already sorted? (or if it is worth adding a helper function that does
this combination for you, as convenience)
As I mentioned in the Iceberg PR, numpy has a set of of "set operation"
functions for arrays (union, intersection, difference). I _think_ this case of
"merging" arrays is a union set operation. In numpy, this is `np.union1d` (but
just for 2 input arrays), for which the docstring says "Return the unique,
sorted array of values that are in either of the two input arrays". But to
note, in numpy this function is actually just implemented as
`np.unique(np.concatenate(arr1, arr2))` under the hood (the `np.unique` already
sorts).
One of the other set operations, `np.intersect1d`, has a keyword
`assume_unique=False/True` with which you can indicate you know the input
arrays are already unique, to be able to use a faster implementation (i.e.
avoiding to call `unique` on the input arrays).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]