jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1570330670
We already have `pa.concat_arrays` that will give a single (non-chunked)
array. However, that expects a list of Arrays, and doesn't work with
ChunkedArrays. So to use it concatenate a list of chunked arrays into a single
one, we need some more gymnastics to flatten the chunks, currently:
```python
>>> merged = pa.concat_arrays([chunk for arr in [a1, a2] for chunk in
arr.chunks])
>>> merged
<pyarrow.lib.Int64Array object at 0x7fe2e69710c0>
[
1,
2,
3,
6,
7,
4,
7,
8
]
```
We should maybe update `pa.concat_arrays` to also accept ChunkedArrays.
Of course, that doesn't make them unique. You can then get the unique values
of the merged array:
```python
>>> merged.unique()
<pyarrow.lib.Int64Array object at 0x7fe2e6bd12a0>
[
1,
2,
3,
6,
7,
4,
8
]
```
But for larger arrays, it might be more efficient to first get the uniques
before actually concatenating, since we can also calculate the uniques values
directly for a ChunkedArray. If we convert the list of chunked arrays into one
chunked array (which is zero copy), and then get the uniques of this:
```python
>>> merged_chunked = pa.chunked_array([chunk for arr in [a1, a2] for chunk
in arr.chunks])
>>> merged_chunked.unique()
<pyarrow.lib.Int64Array object at 0x7fe2e692f880>
[
1,
2,
3,
6,
7,
4,
8
]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]