[ 
https://issues.apache.org/jira/browse/ARROW-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495517#comment-17495517
 ] 

Antoine Pitrou commented on ARROW-14183:
----------------------------------------

So, first, I think you can assume that K will be small compared to N (for 
example K=100 vs. N several millions). You probably should confirm this by 
asking the kernels team.

Maintaining a K-element max-heap is {{O(N log K)}}. You're unlikely to beat 
this by much if you switch to a sorting approach with complexity {{O(N log 
N)}}, even if the sort is more carefully optimized than the heap maintenance.

So I think you should try to explore this approach:
 * for each batch, create a K-element max-heap of the top (or bottom) elements 
in the batch; since the batch is contiguous, you will not be paying the cost of 
chunked index access
 * compute the final result from the K-element heaps: naively, we can create a 
new K-element max-heap and feed it from the existing onces (I don't know if 
there are better algorithms to exploit the fact that the input is already a 
bunch of max-heaps; but that's a second order concern IMHO)

> [C++] Improve select_k_unstable performance
> -------------------------------------------
>
>                 Key: ARROW-14183
>                 URL: https://issues.apache.org/jira/browse/ARROW-14183
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: David Li
>            Assignee: Alvin Chunga Mamani
>            Priority: Major
>              Labels: kernel, pull-request-available
>         Attachments: image-2022-02-16-02-35-42-818.png
>
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> Recently {{sort_indices}} was optimized and refactored in ARROW-14165. One of 
> those optimizations could also be applied to {{select_k_unstable}}'s 
> implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to