jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1570330670

   We already have `pa.concat_arrays` that will give a single (non-chunked) 
array. However, that expects a list of Arrays, and doesn't work with 
ChunkedArrays. So to use it concatenate a list of chunked arrays into a single 
one, we need some more gymnastics to flatten the chunks, currently:
   
   ```python
   >>> merged = pa.concat_arrays([chunk for arr in [a1, a2] for chunk in 
arr.chunks])
   >>> merged
   <pyarrow.lib.Int64Array object at 0x7fe2e69710c0>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     7,
     8
   ]
   ```
   
   We should maybe update `pa.concat_arrays` to also accept ChunkedArrays.
   
   Of course, that doesn't make them unique. You can then get the unique values 
of the merged array:
   
   ```python
   >>> merged.unique()
   <pyarrow.lib.Int64Array object at 0x7fe2e6bd12a0>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     8
   ]
   ```
   
   But for larger arrays, it might be more efficient to first get the uniques 
before actually concatenating, since we can also calculate the uniques values 
directly for a ChunkedArray. If we convert the list of chunked arrays into one 
chunked array (which is zero copy), and then get the uniques of this:
   
   ```python
   >>> merged_chunked = pa.chunked_array([chunk for arr in [a1, a2] for chunk 
in arr.chunks])
   >>> merged_chunked.unique()
   <pyarrow.lib.Int64Array object at 0x7fe2e692f880>
   [
     1,
     2,
     3,
     6,
     7,
     4,
     8
   ]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to