[ 
https://issues.apache.org/jira/browse/ARROW-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511299#comment-17511299
 ] 

Antoine Pitrou commented on ARROW-14183:
----------------------------------------

Well, K should most of the time be a lot smaller than the block length (if K is 
typically 10 or 100, for example).

Another idea for the RecordBatch K-top (the first step of a Table K-top, before 
the K-merge) is to avoid the cost of multi-column compares by doing a sort of 
Radix Heap.

Do a (K+1)-top on the current (first) batch column. There are two possible 
outcomes:
 * if top[K-1] != top[K], then you don't need to handle other columns since 
there is no tie
 * if top[K-1] == top[K], we need to recurse into the next column by:
1) finding the first J such that top[J]==top[K]
2) finding all indices in the current column such that the value is equal to 
top[K]
3) running a (K-J+1)-top on the next batch column with the row indices found in 
step 2) above

> [C++] Improve select_k_unstable performance
> -------------------------------------------
>
>                 Key: ARROW-14183
>                 URL: https://issues.apache.org/jira/browse/ARROW-14183
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: David Li
>            Assignee: Alvin Chunga Mamani
>            Priority: Major
>              Labels: kernel, pull-request-available
>         Attachments: image-2022-02-16-02-35-42-818.png
>
>          Time Spent: 7h 10m
>  Remaining Estimate: 0h
>
> Recently {{sort_indices}} was optimized and refactored in ARROW-14165. One of 
> those optimizations could also be applied to {{select_k_unstable}}'s 
> implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to