westonpace commented on issue #35748: URL: https://github.com/apache/arrow/issues/35748#issuecomment-1585236382
I don't think we document this anywhere but the sort_indices function for chunked arrays works by first sorting the individual arrays and then merging. So you can get part of the way there by combining all the chunks and using `pyarrow.compute.sort`. This still means the individual chunks will be "resorted". At the end of the day we use std::stable_sort for the sorting. I don't know if it is optimized to run quickly on already-sorted arrays but SO seems to suggest it's [not terrible](https://stackoverflow.com/questions/6567326/does-stdsort-check-if-a-vector-is-already-sorted) (though I'm not sure if std::stable_sort will behave the same). Unfortunately, I don't know of any compute function or clever trick to handle the uniqueness part. It would be nice to combine all of these into a single compute vector function that works on chunked arrays if someone wanted to. ``` import pyarrow as pa import pyarrow.compute as pc a1 = pa.chunked_array([[1, 2, 3], [6, 7]]) a2 = pa.chunked_array([[4, 7, 8]]) all_chunks = pa.chunked_array(a1.chunks + a2.chunks) # Cheap sort_indices = pc.sort_indices(all_chunks) # Could be cheaper, but not terrible, probably not worse than O(N*log(N)) sorted_arr = pc.take(all_chunks, sort_indices) # Should be O(N), probably as fast or faster than concat sorted_uniq_arr = pc.unique(sorted_arr) # Unfortunate we have to do this step print(sorted_uniq_arr) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
