pchintar commented on code in PR #9991:
URL: https://github.com/apache/arrow-rs/pull/9991#discussion_r3492694279
##########
arrow-ord/src/sort.rs:
##########
@@ -1000,6 +1016,90 @@ fn sort_fixed_column<const N: usize>(
Ok(())
}
+// Uses the fixed-column comparator for the bounded heap path.
+fn lexsort_topk_fixed<const N: usize>(
+ columns: &[SortColumn],
+ row_count: usize,
+ limit: usize,
+) -> Result<Vec<usize>, ArrowError> {
+ let lexicographical_comparator =
FixedLexicographicalComparator::<N>::try_new(columns)?;
+ Ok(lexsort_topk(row_count, limit, |a, b| {
+ lexicographical_comparator.compare(a, b)
+ }))
+}
+
+// Keeps the smallest `limit` indices in a bounded max-heap.
+// The root is the largest retained index according to `compare`.
+fn lexsort_topk(
+ row_count: usize,
+ limit: usize,
+ mut compare: impl FnMut(usize, usize) -> Ordering,
+) -> Vec<usize> {
+ let mut heap = Vec::with_capacity(limit);
+
+ for idx in 0..row_count {
+ if heap.len() < limit {
+ heap.push(idx);
+ let pos = heap.len() - 1;
+ sift_up_worst_heap(&mut heap, pos, &mut compare);
+ } else if compare(idx, heap[0]) == Ordering::Less {
+ heap[0] = idx;
+ sift_down_worst_heap(&mut heap, 0, &mut compare);
+ }
+ }
+
+ heap.sort_unstable_by(|a, b| compare(*a, *b));
+ heap
+}
+
+// Moves a newly inserted index toward the root while it is larger than its
parent.
Review Comment:
@alamb That's an interesting suggestion, thanks for pointing it out.
I took a look at the loser tree implementation in DataFusion. My
understanding is that it's designed for k-way merging of already sorted
streams, whereas this change is optimizing bounded Top-K selection over a
single unsorted input by maintaining the current best `limit` rows during a
single scan.
For this PR, I chose the bounded heap approach because it fits the existing
`lexsort_to_indices` implementation with a relatively small, self-contained
change. Reworking this to use the loser tree would likely involve pulling that
implementation into arrow-rs and adapting it for a different use case, which
felt like a substantially larger change than the scope of this optimization.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]