[GitHub] [arrow] jorisvandenbossche commented on issue #35748: [Python] Implement efficient merging of chunked arrays

via GitHub Tue, 06 Jun 2023 02:33:14 -0700


jorisvandenbossche commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1578291747


   Yes, but what you call "merge" is essentially a "concat (or combine into 
chunked arrya) + uniques + sort" ?
   
   ```
   >>> merged_chunked = pa.chunked_array([chunk for arr in [a1, a2] for chunk 
in arr.chunks])
   >>> merged_chunked.unique().sort()
   <pyarrow.lib.Int64Array object at 0x7f79b66b3520>
   [
     1,
     2,
     3,
     4,
     6,
     7,
     8
   ]
   ```
    
   And the question is then if this can be done more efficiently than the 
combination of those existing kernels, if you know the input is already unique 
and/or is already sorted? (or if it is worth adding a helper function that does 
this combination for you, as convenience)
   
   As I mentioned in the Iceberg PR, numpy has a set of of "set operation" 
functions for arrays (union, intersection, difference). I _think_ this case of 
"merging" arrays is a union set operation. In numpy, this is `np.union1d` (but 
just for 2 input arrays), for which the docstring says "Return the unique, 
sorted array of values that are in either of the two input arrays". But to 
note, in numpy this function is actually just implemented as 
`np.unique(np.concatenate(arr1, arr2))` under the hood (the `np.unique` already 
sorts). 
   
   One of the other set operations, `np.intersect1d`, has a keyword 
`assume_unique=False/True` with which you can indicate you know the input 
arrays are already unique, to be able to use a faster implementation (i.e. 
avoiding to call `unique` on the input arrays).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35748: [Python] Implement efficient merging of chunked arrays

Reply via email to