[
https://issues.apache.org/jira/browse/ARROW-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495517#comment-17495517
]
Antoine Pitrou commented on ARROW-14183:
----------------------------------------
So, first, I think you can assume that K will be small compared to N (for
example K=100 vs. N several millions). You probably should confirm this by
asking the kernels team.
Maintaining a K-element max-heap is {{O(N log K)}}. You're unlikely to beat
this by much if you switch to a sorting approach with complexity {{O(N log
N)}}, even if the sort is more carefully optimized than the heap maintenance.
So I think you should try to explore this approach:
* for each batch, create a K-element max-heap of the top (or bottom) elements
in the batch; since the batch is contiguous, you will not be paying the cost of
chunked index access
* compute the final result from the K-element heaps: naively, we can create a
new K-element max-heap and feed it from the existing onces (I don't know if
there are better algorithms to exploit the fact that the input is already a
bunch of max-heaps; but that's a second order concern IMHO)
> [C++] Improve select_k_unstable performance
> -------------------------------------------
>
> Key: ARROW-14183
> URL: https://issues.apache.org/jira/browse/ARROW-14183
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: David Li
> Assignee: Alvin Chunga Mamani
> Priority: Major
> Labels: kernel, pull-request-available
> Attachments: image-2022-02-16-02-35-42-818.png
>
> Time Spent: 4h 40m
> Remaining Estimate: 0h
>
> Recently {{sort_indices}} was optimized and refactored in ARROW-14165. One of
> those optimizations could also be applied to {{select_k_unstable}}'s
> implementation.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)