[GitHub] [arrow] westonpace commented on issue #35748: [Python] Implement efficient merging of chunked arrays

via GitHub Fri, 09 Jun 2023 16:23:32 -0700


westonpace commented on issue #35748:
URL: https://github.com/apache/arrow/issues/35748#issuecomment-1585236382


   I don't think we document this anywhere but the sort_indices function for 
chunked arrays works by first sorting the individual arrays and then merging.  
So you can get part of the way there by combining all the chunks and using 
`pyarrow.compute.sort`.  This still means the individual chunks will be 
"resorted".  At the end of the day we use std::stable_sort for the sorting.  I 
don't know if it is optimized to run quickly on already-sorted arrays but SO 
seems to suggest it's [not 
terrible](https://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted)
 (though I'm not sure if std::stable_sort will behave the same).
   
   Unfortunately, I don't know of any compute function or clever trick to 
handle the uniqueness part.
   
   It would be nice to combine all of these into a single compute vector 
function that works on chunked arrays if someone wanted to.
   
   ```
   import pyarrow as pa
   import pyarrow.compute as pc
   
   a1 = pa.chunked_array([[1, 2, 3], [6, 7]])
   a2 = pa.chunked_array([[4, 7, 8]])
   
   all_chunks = pa.chunked_array(a1.chunks + a2.chunks) # Cheap                 
                                                                                
                                                      
   sort_indices = pc.sort_indices(all_chunks) # Could be cheaper, but not 
terrible, probably not worse than O(N*log(N))                                   
                                                            
   sorted_arr = pc.take(all_chunks, sort_indices) # Should be O(N), probably as 
fast or faster than concat                                                      
                                                      
   sorted_uniq_arr = pc.unique(sorted_arr) # Unfortunate we have to do this 
step                                                                            
                                                          
   
   print(sorted_uniq_arr)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #35748: [Python] Implement efficient merging of chunked arrays

Reply via email to