[
https://issues.apache.org/jira/browse/ARROW-14183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17511299#comment-17511299
]
Antoine Pitrou commented on ARROW-14183:
----------------------------------------
Well, K should most of the time be a lot smaller than the block length (if K is
typically 10 or 100, for example).
Another idea for the RecordBatch K-top (the first step of a Table K-top, before
the K-merge) is to avoid the cost of multi-column compares by doing a sort of
Radix Heap.
Do a (K+1)-top on the current (first) batch column. There are two possible
outcomes:
* if top[K-1] != top[K], then you don't need to handle other columns since
there is no tie
* if top[K-1] == top[K], we need to recurse into the next column by:
1) finding the first J such that top[J]==top[K]
2) finding all indices in the current column such that the value is equal to
top[K]
3) running a (K-J+1)-top on the next batch column with the row indices found in
step 2) above
> [C++] Improve select_k_unstable performance
> -------------------------------------------
>
> Key: ARROW-14183
> URL: https://issues.apache.org/jira/browse/ARROW-14183
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: David Li
> Assignee: Alvin Chunga Mamani
> Priority: Major
> Labels: kernel, pull-request-available
> Attachments: image-2022-02-16-02-35-42-818.png
>
> Time Spent: 7h 10m
> Remaining Estimate: 0h
>
> Recently {{sort_indices}} was optimized and refactored in ARROW-14165. One of
> those optimizations could also be applied to {{select_k_unstable}}'s
> implementation.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)